Parallel processing of large-scale particle systems on GPU

Source: Internet
Author: User
Parallel processing of large-scale particle systems on GPUOriginal article: [latta04] Luta latta, "massively parallel particle systems on the GPU latta," <shaderx3> 2004 Author: Lutz latta this article is copyrighted by the original author, for personal use only, do not reprint, do not use for any commercial purposes. Fannyfishblog: http://blog.csdn.net/fannyfishamma@zsws.org IntroductionThe real world is filled with small objects with irregular motion. People design physically correct particle systems (PS) to simulate these natural phenomena. Over the past few decades, particle systems have been widely used in the field of instant rendering and pre-rendering (such as film, advertising) to simulate different volume effects. Particle systems have a long history in the gaming and computer graphics fields. As early as 1960, there were already games that used 2D pixel smog to simulate an explosion. The first article explains the computer graphics of particle systems [reeves83]. It was done in luashua film. Star Trek II "After the special effect was created. Mr. Reeves described the basic data implementation and Motion Simulation Methods of particles in his thesis-these two concepts have been extended to date. Subsequently, [sims90] proposed a solution for multiple parallel processors on the supercomputer. This article uses many methods proposed by him and [mcallister00] to simulate the speed and position. A recent paper describing CPU-based particle systems is [burg00]. The performance of the instant particle system is mainly subject to two factors: the fill rate (fillrate) or the data transfer bandwidth between the CPU-graphics hardware (GPU. Fill rate is the prime number of images that can be rendered by the GPU per frame. When the particles are large and many particles overlap, the performance will be seriously affected. Since more and more smaller particles can achieve better simulation results, the filling rate becomes less and less important in particle simulation. Therefore, the transmission bandwidth between the CPU responsible for simulation and the GPU responsible for rendering determines the system performance. In general, the graphics bus used by other rendering tasks in a game program only allows the CPU-based particle system to process 10,000 particles per frame. To minimize particle data transmission, you can integrate simulation and rendering on the GPU for processing. Before implementing a GPU-based particle system, you must first understand the stateless and State-preserving particle systems concepts. State-independent PS: calculates particle data only by a closed function defined by the initial attributes of the particle and the current time. State-related PS is used to calculate particle data based on the attributes and environment changes of the previous frame (such as moving the collision body. These two simulation methods have their own application fields and need to be selected based on different simulated effects. [Nvidia01] describes the state-independent PS Based on the PC's first-generation programmable GPU. The next section will briefly describe it. With the development of hardware, it is also possible to implement State-related PS in today's floating point graphics hardware. The subsequent sections will focus on its implementation. The physical simulation of GPU particles allows you to flexibly combine multiple motion and location operations, such as gravity, local force, and collision with geometric elements or height graphs. In addition, to obtain the correct alpha-blended rendering result, a parallel sorting algorithm is used to sort particles by distance. State-independent Particle System[Nvidia01] describes how to use vertex shader (also called vertex program) to implement ps. This PS is state independent, that is, it does not store attributes such as the current position of the particle. To calculate the particle position, you must find a closed function defined by the initial attributes of the particle and the current time. Therefore, State-independent PS is difficult to respond to dynamic environments. Particles cannot collide with the environment, but can only be affected by the gravity acceleration g through simple functions: P is the calculation result of the particle position, P0 represents the initial position, V0 represents the initial speed, t indicates the current time. If you want to add the impact of simple collision and local force, you need more complex functions. The calculation rules for other particle attributes (such as particle orientation, size, and texture coordinates) except position and velocity are simpler. Generally, a constant is added based on the particle survival time. In the following section, although the location uses state-independent simulation, these attributes are simulated using state-related. The State-independent PS feature makes it suitable for simulating a small amount of simple and environment-independent effects. For example, in an action game, the effects of weapon hitting, water or sparks. Particle Simulation on graphics hardwareThe following sections detail state-related particle systems on GPUs. First, we will briefly introduce the algorithm and describe the particle storage method and processing process. Algorithm OverviewState-related particle systems use textures to store the position and speed of particles. These textures are also rendertarget. In a rendering channel, calculate and update the speed Texture Based on the speed of the previous timestamp. The update process executes one-step iterative integration to calculate the acceleration force and collision reflection. Another rendering channel uses a similar method to update the position, and uses the speed obtained from the previous channel to obtain the product of the position. Based on the selected integration method, it is possible to skip the update speed channel and calculate the product of the position through acceleration. To avoid rendering samples, you can choose to sort the location texture data by the distance from the camera. The sorting algorithm occupies several additional paintchannels to process textures containing distance data. Next, convert the location texture to the vertex buffer. Finally, draw the geometric data in the traditional way-as a vertex genie, triangle, or square. The entire algorithm can be divided into the following six basic steps, which will be described in the following sections: 1. Processing survival and death 2. Update speed 3. Update location 4, sort for alpha blending (optional) 5. Convert texture data to vertex data 6 to render particles. Particle Data StorageThe most important attribute of a particle is its position and speed. We use a floating point texture to store the positions of all particles. The color components Save the coordinates of X, Y, and zrespectively. You can use this texture as a one-dimensional array, where the texture coordinates represent the array index. Because the hardware supports limited texture sizes, two-dimensional textures are usually used to represent larger arrays. Texture is also defined as rendertarget to ensure that location data for dynamic computing can be saved. By rendering a full-screen square (QUAD), the GPU calls pixel shader once for each pixel on the rendertarget. Pixel shader needs to read the position of the previous frame from the position texture. Because one texture cannot be read and rendered at the same time, we use the dual Cache Technology-create two textures to store the data and new data of the previous frame. You can use a similar method to create a speed texture. Because of low speed requirements on precision, the texture can use a 16-bit floating point format. Based on the use of the integral algorithm, explicit storage speed may not be needed (see page 126 ). In this case, you need to use the three-cache technology to create the third position texture. Figure 2.4.1 use multiple textures to store particle data. If you use the iterative integration method to process other attributes of a particle (such as orientation, size, color, and transparency), you also need to create a dual-Cache texture. However, these attributes usually only require simple calculation rules or even static values. We can use a method similar to the State-independent particle system (see page 120 ), A function is used to describe the lifetime of a particle (for example, using an initial value and microscore or a series of key frames ). To execute this function, you need to create additional textures to store two static values for each particle: attributes related to the lifetime and particle type. To reduce the number of static attribute Parameters updated before rendering particles, we suppose we can group particles based on the particle type. For example, specify a particle transmitter as one type or a group of particles emitted by the transmitter as one type. Particle mass is used to calculate the acceleration of external forces. It has two storage methods: one is to upload the mass or its function as a particle type parameter, assuming that the mass of all particles is equal, the second is to save the quality of each particle in the static texture mentioned above. To sum up, different textures use the same texture coordinate to store different attributes of the Same particle. You can use the particle type parameter to calculate other attributes as needed. Processing lifetimeParticles may exist permanently or only for a period of time. It is easiest to simulate the permanent existence of all particles because you only need to update the initial value to the attribute texture once. However, this situation is rare after all. The discussions in the following sections only focus on the finite lifetime of particles. This means that the particle system must process the birth, distribution, and death, and deallocation of the particle ). A newly generated particle needs to use a valid index of the attribute texture to associate the new data. In essence, the allocation problem is continuous, while the GPU cannot implement an efficient parallel data allocation algorithm. Therefore, we choose to use the traditional quick allocation table on the CPU to generate valid indexes. The simplest allocation policy is to push all valid indexes into one stack. A relatively complex distributor uses an optimized heap structure to ensure that the minimum legal index is always returned. This splitter ensures that the existing particles are always concentrated in the Data Segment Before the stack, and only the data segment can be operated in subsequent simulation and rendering steps. After the index is obtained, the new particle data is rendered as a pixel to the attribute texture. The CPU can use a complex algorithm to determine the initial data. For example, different probability distributions are used to determine the initial position and speed. The death of particles is handled by the CPU and GPU respectively. The CPU notifies the distributor that the particle is dead and marks this index as allocable. The GPU uses an additional channel, which is determined by the lifetime and the time of survival. Move the positions of dead particles to invisible areas, such as Infinity. Because particles often disappear slowly or are out of the visible area at the end of life, to improve efficiency, you can choose not to use the rendering channel for processing death, instead of one-step clean-up operation. Update speedThe first part of motion simulation is to update the particle velocity. By rendering a full-screen Square, the GPU executes pixel shader once for each pixel on the rendertarget to update the speed. First, set one of the speed textures in the Double Cache to the current rendertarget, and then read the speed of the previous timestamp from the other texture. Other required data can be obtained from the attribute texture or the constant set before the shader is executed. Multiple Speed operations can be combined as needed (see [sims90] and [mcallister00]): Global force (such as gravity, wind force), local force (such as gravity, rejection ), speed Resistance and collision reflection. We save these parameters to constant registers. Dynamic Hybrid operation is a classic problem in real-time graphics. here we can use a similar solution that combines light sources with materials. Use multiple channels to process other operations. The global force, such as gravity, always affects the particle by a fixed acceleration in a certain direction. The local force affects the particles according to the previous position read from the position texture. Assuming that the particle is affected by a magnet, this is equivalent to making the particle affected by a local acceleration pointing to a certain point. The force size is inversely proportional to the square from the particle to the magnet, or is a constant within a certain distance (see [mcallister00]). Assuming that the acceleration of a particle always points to a point in a straight line, and the point closest to the particle in a straight line, the eddy current effect can be achieved. Flow Field textures can be used to achieve more complex local forces. Because GPU's texture search operation is low, it is very efficient to map 2D or 3D textures containing flow field velocity vectors to particle positions. The velocity vector VFL obtained by sampling is brought into the stochastic's formula to rescue the external force from the ball. In the formula, θ is the viscosity coefficient, and r is the radius of the ball (here referred to as the particle ), V represents the velocity of the particle. These constants can be stored in a constant register for operation. Figure 2.4.2 combines multiple forces into one force vector. Combine global and local forces into a force vector 2.4.2. Then the acceleration is calculated according to Newton's Law: where a is the acceleration vector, F is the accumulated force, and m is the particle mass. Assuming that all particles are in the unit mass, you can ignore this step and use the force directly as the acceleration. Next, use the simple Euclidean integral formula to calculate the velocity: V is the current velocity of the particle, V is the speed of the previous timestamp, and T is the timestamp. Damping is another type of speed operation. It usually refers to the scaling effect of viscous materials or air resistance on the original speed. This is a special case in formula 2.4.2 when the flow field speed is 0. The opposite operation of damping can be used to simulate self-propelled objects, such as bee colony. Collision is also an important operation. The advantage of GPU is not to deal with a simple collision with a plane or sphere, but to deal with collision between particles and the terrain described by a height chart. Calculate the normal vector based on the three points of the sampled height graph, and then obtain the reflection vector based on the normal vector. Normal maps can also be used to store normal vectors. Note: You can use a shadow ing-like algorithm to render the object depth value to the texture and dynamically generate a height chart. In this way, the collision with complex ry can be achieved using multiple height graphs. For details about this technology, refer to [kolb04]. After the collision is detected, calculate the collision reflection, that is, the speed after the collision (refer to [sims1990]). First, the current speed is divided into horizontal and vertical components. Assume that N is the normal vector of the collision point, and the following formula is used to calculate the component: vbc is the speed before the collision, vn is the vertical component of the velocity, and VT is the horizontal component. You can perform operations on two components and two material attributes respectively. The Dynamic Friction x affects the horizontal component, and the elastic coefficient E determines the vertical component. The calculation speed after collision is as follows: there are two problems in this method that may cause rendering samples. Because the collision occurs after the external acceleration force is processed, the dynamic friction will eventually make the speed equal to (very close) 0, and eventually lead to the suspension of particles in the air. To solve this problem, you need to specify a threshold value. When the speed is smaller than this value, it will not change. The second problem is that particles will penetrate into the collision body. This is caused by an angular collision body (such as a height chart) or two very close collisions. The solution is to capture this particle and place it in the body of the collision. The common method is to perform one-step integration to predict the position of the next timestamp of the particle, and then detect the collision (see Figure 2.4.3a ). The predicted particle position is calculated as follows: Figure 2.4.3 particle collision: a) normal collision detection B) Two collisions to prevent particles from entering the collision body. The PBC is the position of the previous timestamp. If two collisions are performed, the particle can be prevented from entering the collision body. The first time the time stamp is used, the second time the next time stamp is used, in this case, you need to distinguish between the collision and the collision (see Figure 2.4.3b ). If the latter is used, the particle will pop up the collision body immediately or at the original speed. Based on the normal vector, we can determine how to calculate the speed: Update locationThe second part of motion simulation is to update the particle position. This section discusses in detail the various points mentioned above (see page 121 ). When processing massive data on a GPU, you can only use a simple point algorithm. Suitable for particle simulation, there are Euler points and verlet points. Euler points have been used in the previous section to calculate the speed based on acceleration. We can use the same method for calculating the computed speed position: P and P respectively represent the current position and the previous position. Verlet points are simpler than Euler points (see [verlet67]). The particle system that uses verlet points does not display the storage speed (refer to [jakobsen01]), but only stores the positions before two timestamps. The biggest benefit of this is reduced memory consumption and the speed update channel is left. Assuming that the timestamp is a constant, the obtained formula is only related to acceleration. P is the position before the two timestamps. It can be seen that verlet points can efficiently process simple acceleration (global and local ). However, complicated speed operations such as collision reflection are different, which requires the execution of the positional formula that implicitly changes the speed after the collision. An efficient solution is to set the position constraints. During a collision, you only need to simply remove the position from the collision body. Because the motion is constrained, the reflection speed is implicitly introduced during the derivation. You can also add constraints between two particles or two particles to simulate the particles of the cloth or hair ([jakobsen01] details ). Sort by alpha blending(Unfinished) Convert texture data to vertex dataPreparing texture data as vertex data is a hardware feature supported by PC GPUs recently. Currently, DirectX and OpenGL can be implemented through vertex shader (VS) 3.0 (see [microsfot02]) and arb_vertex_shader extension (see [opengl03]). provides the vertex textures feature. To use vertex textures to render a particle, you need to create a static vertex buffer that stores the texture coordinates of all the pixels in the position texture. When rendering this VB, vertex shader reads the particle position from the Texture Based on the texture coordinates recorded by it (see Figure 2.4.5 ). The current hardware has a great delay in reading data from vertex textures. That is to say, it takes a long time from sending a command to returning the texture value. Fortunately, some numeric operations that are irrelevant to the texture value can be performed while waiting. Fortunately, the vertex shader of the particle system has many operations unrelated to the particle position (see page 131), which makes up for the efficiency problem caused by texture reading. Figure 2.4.5, in order to render the particle, you need to access the texture in vertex shader. You can also use "U Ber-Buffers" (also called Super buffers, refer to [mace04]) to store vertex or pixel data. This concept is generated and extended with the first generation of floating point GPU, but only supported by OpenGL APIs. The ext_pixel_buffer_object extension provided by OpenGL (see [nvidia04]) provides fast asynchronous test data in the GPU. This is because the pixel data can be tested to the vertex stream (see Figure 2.4.6), which is also the basic concept of implementing U Ber-buffer. Figure 2.4.6 uses the uber-buffer concept to test the pixel data to the vertex data. A particle can be rendered as a vertex Sprite, triangle, or square (see page 131 ). If vertex textures only uses data transmission, it does not have to use any geometric type. If multiple vertices are required for each particle, you can simply repeat the texture coordinates of each particle in static VB three or four times. Or use the "vertex stream frequency" feature supported by vs3.0. This technology can reduce the frequency of updating data transmitted to vertex shader. A piece of data in a vertex stream can be used by a series of consecutive vertices, while other vertex streams are updated at a certain frequency. Therefore, for particles, vertex buffer does not need to contain duplicate data, but only needs to set the update frequency (triangle is 3, square is 4 ). On hardware that does not support veretx texture or vertex stream frequency, using U Ber-buffer requires that the particle location be manually retained before rendering. In the rendering phase, a pixel is repeatedly written three or four times. In terms of efficiency, the current implementation uses a particle to correspond to a vertex Sprite. Rendering ParticlesFinally, in the traditional method, the transformed vertex position is used to render the elements in the frame cache. As mentioned above, in order to reduce the load of vertex units, select the vertex genie to render particles. This saves 3 or 4 times the number of vertices compared to a triangle or square. However, the disadvantage of using the dot genie is that particles are always aligned on the axis and cannot be rotated around the Z axis in 2D. To overcome this disadvantage, we can rotate texture coordinates in pixel shader to achieve two-dimensional rotation. The use of vertex genie or triangle/square depends on the loads of a specific program on the vertex and pixel unit. During rendering, other attributes (such as color and size) are calculated using the particle type parameters (see page 121 ). These attributes are calculated using particle lifetime or pseudo-random functions. The current implementation uses the following rules to calculate the particle attributes: the particle size takes a random value in the range defined by the particle type. The initial angle and rotation speed (two-dimensional value of the screen space) are also random values in the range defined by the particle type. The color and transparency of particles must be obtained by interpolation between four key frames based on the lifetime. Each particle type can define four unequal key frames. Three linear function formulas must be implemented on the GPU to process particle data. Texture cannot be changed during pixel-by-pixel rendering. Therefore, different textures can only be combined into a 2D texture. In this way, you need to select the appropriate texture coordinates for each particle. If you use the dot Sprite, texture coordinates are automatically converted to the range of [0-1] X [0-1] During grating. In this case, select the sub-texture in pixel shader. Fortunately, the texture coordinate transformations used for 2D rotation can be used to select Sub-textures for free, because two vector operations are required for Matrix transformations of 2x2 or 3x2. ConclusionOf course, the current processor speed is not fast enough. On the latest GPU, when simulating and instantly rendering a particle texture of X, only a few effects can be enabled and sorting is disabled. Only 12 512x512 particles can be processed when all effects are enabled. The next generation of graphics cards should be expected to improve performance. This article explains how to design and implement a State-related, physically correct Particle System on today's programmable graphics hardware. During the simulation process, you can use Euler's or verlet points to update the position of the particle, use a relatively simple algorithm to process other particle attributes, and perform these calculations only when necessary. An efficient parallel sorting algorithm is also introduced. The biggest advantage of GPU-based particle system is that it is very efficient for single-step operations on the entire dataset. Once the basic framework is completed, you can easily use HLSL to implement endless algorithms to process speed and location. Although the State-related particle simulation described in this article is not suitable for the upcoming next generation of video game consoles, it remains to be examined. However, with the ever-changing multi-processor hardware, parallel processing particle systems will play a significant role. Thank you(Omitted .) Index[Batcher68] batcher, kenth E ., "sorting networks and their applications," Spring Joint Computer Conference, afips proceedings 1968. [burg00] Van der Burg, John, "Building an advanced particle system," game developer magazine (03/2000 ). [jakobsen01] jakobsen, Thomas, "Advanced Character physics," GDC proceedings http://www.gamasutra.com/resource_guide/20030121/jacobson_pfv.htm, 2001. [kolb04] Kolb, Andreas; latta, Lutz; rezk-salama, christof, "hardware-based simulation and Collision Detection for large particle systems," graphics hardware proceednigs 2004. [lang03] Lang, Hans W ., "odd-even merge sort," available online at least mace, Rob, "OpenGL ARB superbuffers," available online at http://www.ati.com/developer/gdc/SuperBuffers.pdf, 2004 [mark03] Mark, William R .; glanville, R. steven; Akeley, Kurt; kilgard; Mark J ., "CG: A system for programming graphics hardware in a C-like language," Siggraph proceedings 2003. [mcallister00] McAllister, David K ., "The design of an API for particle systems," technical report, Department of Computer Science, University of North Carolina Lina at Chapel Hill, 2000. [microsoft02] Microsoft Corporation, "directx9 SDK," available online at http://msdn.microsoft.com/directx/, 2002-2004. [nvidia01] NVIDIA Corporation, "nvidia sdk," available online at http://developer.nvidia.com/, 2001-2004. [nvidia04] NVIDIA Corporation, "OpenGL extension ext_pixel_buffer_object," available online at http://www.nvidia.com/dev_content/nvopenglspecs/GL_EXT_pixel_buffer_object.txt, 2004. [opengl03] OpenGL ARB, "OpenGL extension arb_vertex_shader," available online at http://oss.sgi.com/projects/ogl-sample/registry/ARB/vertex_shader.txt, 2003. [reeves83] Reeves, William T. "particle systems-technique for modeling a class of fuzzy objects," Siggraph proceedings 1983. [sims90] Sims, Karl, "particle animation and rendering using Data Parallel Computation," Siggraph Proceedings 1990. [verlet67] verlet, Loup, "computer experiments on classical fluids. i. thermodynamical properties of Lennard-Jones molecules, "Physical Review (159/1967 ).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.