Original: "A trip through the Graphics Pipeline 2011"
Translation: Sword of the past
Reprint Please indicate the sourceWelcome back. This time we'll look at the grating of the triangle. But before the rasterization triangle, we need to perform the triangle setting, and before setting up the triangle, I'll explain what to do to prepare for it, let's talk about the triangular hardware rasterization algorithm.
how to draw a triangleFirst of all, give the person who is familiar with this part and write the optimized soft texture map a little hint: the triangle grating to deal with a bunch of things at once: the shape of the trace triangle, interpolated coordinates u and V (for perspective correction map, is u/z,v/z and 1/z), perform z-buffer test (for perspective correction mapping, you can use 1/ Z buffer substitution), and then process the actual texture (and shading), the above steps are in a cycle with the available registers. In the hardware, these things are packaged into very neat small modules, which are easy to design and test independently. The "triangular rasterizer" in hardware is the block that tells you which pixels the triangle covers, and in some cases it gives the coordinates of the center of gravity of those pixels in the triangle. But that's the only thing. Not only does not give u and V, not even 1/z. Of course there are no textures and shading, but these are not a thing by using dedicated textures and shader units. Second, if you have written your own triangular mapper, you might have used an incremental scanline raster algorithm like Chris Hecker, a perspective texture map. This is a great way to be on a processor without a SIMD unit, but it is not very suitable for modern processors with high-speed SIMD units, even worse for hardware. It's like a game console that's in the corner, and no one's interested at all. It's like a triangular grating. The protection band for the bottom and right edges of the screen is cropped very quickly, but not so fast for the top and left sides. It's just an analogy. So, what's wrong with this algorithm for hardware? First, it does rasterize the triangle by a scan line. When there is a problem with pixel coloring, we want the rasterizer to output a group of 2x2 pixels (so-called "block quads"-not to be confused with "quadrilateral Quad" primitives, quad entities are decomposed into a pair of triangles in the pipeline). Because we not only run two "instance instances" in parallel, but also start drawing from the first pixel on their respective scan lines, they may be far away and lead to the inability to generate the 2x2 blocks we want, which is the embarrassment of the scanline algorithm. And it is difficult to parallelize efficiently, in the direction of X and Y-which means that a triangle with a width of 8 pixels and a height of 100 pixels is very different from a triangle with a height of 100 pixels and 8 pixels. Now it's time to "loop" the "X" and "Y" steps to avoid bottlenecks--but if we do all the work on the "Y" step, the "X" Loop doesn't matter! This is a bit of a hassle.
a better way in the 1988 Pineda paper, a very simple (more hardware-friendly) rendering triangle was mentioned. This method can be summed up in two sentences: the symbol distance to the line can be calculated by the 2D dot product (multiplied and added)-just as the symbol distance to the plane can be computed by a 3D click. and a triangle can be defined essentially as a collection of all points on the right side of a three-sided edge. So just walk through the coordinates of all the pixels and test if they're in the triangle. This is the most basic algorithm. Note, for example, when we move a pixel to the right, we add a number to the X and keep y unchanged. The formula for our side has the following form: A,b,c is a triangular constant, so for x+1 it is: In other words, once you get the value of the Edge's formula at the known point, the value of the neighboring pixel is only summed up. Also note that this is easy to parallelize: such as AMD hardware can be rasterized 8x8=64 pixels (or Xbox360, refer to "Real-time Rendering" third edition). You only use the calculation which. Each triangle (and edge) is computed at once and saved in the register. Then just calculate the three-side formula in the upper-left corner, execute 8x8 parallel add our calculated constants, to rasterize an 8x8 pixel block, and then test the result symbol bit to determine whether each 8x8 pixel is inside or outside the edge. This calculates three sides, very fast, an 8x8 triangular raster block is well suited for parallelization, and there is nothing more complicated than doing a lot of integer addition operations! That's why you want to align to the fixed-point (fixed-point) grid in the previous section-so we can use integer arithmetic here. Integer accumulators are much simpler than floating-point arithmetic units. Of course we can choose the width of the accumulator to just support the size of the viewport we want, enough subpixel accuracy, and a suitable size protection band of about one to many times. Incidentally, there is a tricky point here: the fill rule that you need to make sure that any pair of triangles that share an edge, no pixels near the shared edge are missing or rasterized two times. Both D3D and OpenGL use the so-called "top left" padding rules, and the specifics are explained in their respective user manuals. I'm not going to go over it here, but notice that this integer rasterizer subtracts 1 from the constant entries of some sides during the triangle setup process. Make sure it doesn't appear to be the problem-- compared to the way Chris did in his article. The combination of the two methods is great. still has a problem: how do we find which 8x8 blocks to test? Pineda proposed two strategies: 1) scan only the entire triangle bounding box, or 2) A smarter scenario: once no triangle sample points are hit, it stops repeating. Well, there's no problem if you just test a little bit of pixels at a time. But now we have to deal with 8x8 pixels! At the same time do 64 parallel add, but finally found that the dead of any pixel, too wasted. So don't do that.
What we need here is more hierarchy I just talked about the way the rasterizer works (the actual output of the sample amount). To avoid extra work at the pixel level, we should add another rasterizer before it, which does not turn the triangles rasterization into pixels, just divides the 8x8 pixel blocks into tiles (McCormack and McNamara have some details in the papers, As well as Greene's " hierarchical Polygon Tiling with Coverage masks" The idea was used). The rasterization of the equation of the edge to the covered tile works very much like a rasterized pixel; what we want to do is to calculate the upper and lower bounds of the tile according to the equation of the edge, because the equation is linear, so the extremum is on the boundary of the tile-in fact, you can loop 4 corner points, The symbols of A and b factors in the formula can be used to determine which corner it is. The line at the bottom is less computational, and requires the same hierarchy--some parallel integer accumulators. If you want to estimate the edge equation for a corner of a tile, you might as well upload it to a fine-grained rasterizer: each 8x8 block needs a reference value, remember? Therefore, to perform a coarse-grained rasterization first to get a tiles that may be covered by a triangle, the grating can do a little bit (8x8 is sufficient), it does not need to be very fast (because it is only used to execute each 8x8 block) at this level, the cost of finding empty blocks is relatively small. can refer to Greene's paper and Mike Abrash's "rasterization on Larrabee" to implement a full-level rasterizer. But for a hardware rasterizer: It actually adds some work on the small triangles (unless you can skip the hierarchy level, but the hardware data flow is not designed), and if the triangles are very large, do a lot of rasterization work. This architecture generates pixel locations very quickly, faster than the shader unit. However, the real problem is not dealing with large triangles: they are effective for any algorithm (including, of course, the scan Line grating algorithm). The main problem is the small triangles. If you have a bunch of small triangles that generate 0 or 1 visible pixels, you also need to perform a triangle setting (as you'll see), at least one step coarse-grained rasterization and one-step fine-grained rasterization for 8x8 blocks. Small triangles are easy to perform triangle settings, as well as coarse-grained rasterization boundaries. It's important to note that this algorithm is expensive for thin slices (long, narrow triangles)-you have to traverse a lot of tiles, but you can get very little coverage pixels. So this is a very slow situation and you have to avoid it as much as possible.
What does the triangle setup phase do? I've already talked about the triangulation algorithm for triangles, which only needs to look at the constants used by each edge during the Triangle Setup process:
- The triangle in the edge equation is three edges A, B, C.
- Some of the derived values mentioned earlier; If you do not add another value, the 8x8 matrix is not always stored in the hardware. The best way to do this is to compute only in hardware, use the Carry-hold accumulator (aka 3:2 Reducer, which I wrote before) to reduce the individual and formula calculations, and then complete the general addition.
- Refer to obtaining the four corners of the tile method to get the upper and lower bounds of the edge equation to do coarse-grained rasterization.
- On the first coarse-grained rasterized reference point, the initial value of the edge equation (adjusting the fill rule).
...... These are the calculations to be done during the Triangle setup phase. It can be attributed to the multiplication of several large integers for the edge equation, their initial assignment, the multiplication of some stepping values, and some other logical computations with low overhead.
other rasterization problems and pixel outputOne thing that has not been mentioned yet is that the clipping rectangle (scissor rect). This is just a screen-aligned rectangular mask pixel. The rasterizer does not generate pixels outside the rectangle. This is fairly easy to implement-coarse-grained rasterizer can directly reject tiles that do not overlap scissor rect, and the fine-grained rasterizer will use the overlay pixel mask of the "rasterized" scissor rect for the and Logic and operation ("Rasterization" here refers to a row-by-column integer comparison, and operations of some bits). There is also a problem with multiple anti-aliasing. The biggest challenge now is to test multiple sample points per pixel--dx11 hardware needs to support at least 8x MSAA. Note that the sampling position in each pixel is not in the grid of rules (this is not good for approximate horizontal or approximate vertical edges), but most of the sides of the direction can get good results. These irregular sampling locations are the deadly points of the scanline algorithm (which is another reason not to use them!). , but it is easy to support the Pineda-style algorithm, which is to calculate some offsets on each edge during the Triangle setup phase, and then add and test these offsets in parallel to each pixel in place of the method that computes only one point. For example, 4x MSAA, two things can be done on an 8x8 rasterizer: Each sample point can be treated as a special "pixel", which indicates that a valid tile size is a 4x4 actual screen pixel, each block in a fine-grained raster has a 2x2 position corresponding to a "pixel", Or you can run 4 times with 8x8 actual pixels. 8x8 seems a little big, I assume that AMD is the way it works, and the other MSAA are similar. Anyway, we now have a fine-grained rasterizer that gives the location of the 8x8 block on each block plus the mask of the coverage area. Very good, but it's only half the story--today's hardware performs early z and Hierarchica z tests before performing pixel shader, and the actual rasterization is intertwined with the z process. But it's better to be separate; So in the next section, we'll talk about a variety of Z-processes, z-comparisons, and some triangle settings-just the rasterization setting we just put, but there are also interpolation values for z and pixel shading, and they need to be set up before.
PrecautionsI have linked some of the rasterization algorithms I think are representative (these are available on the web). There are some algorithms I have not tried to give an introduction to this, I am afraid this piece of content is a little more complicated to write. This article assumes the use of high-end PC hardware platforms. In most areas, especially in mobile/embedded, which is called tile renderer, the screen is divided into several tiles to render separately. This is different from the 8x8tile rasterization I've talked about. The tile-based renderer also requires at least a very coarse-grained rasterization phase, which will pre-locate the chunk of tile that is covered by each triangle, which is often referred to as "boxing (Binning)". The tile-based renderer works differently, and it has different design parameters than the post-order (sort-last) schema. After finishing D3d11 's pipeline, I would probably use one or two articles to talk about tile-based renderers (if interested), but now ignore them, such as the PowerVR chip on a commonly used smartphone, which is handled in a somewhat different way. In 8x8 blocks (other sizes have the same problem), when a triangle is smaller than a certain size or an inappropriate scale, a lot of rasterization is needed, and it can have a bad effect in the process. I would like to tell you a magical algorithm that is easy to parallelize, but I do not know that some hardware vendors are not doing very well. So for now, these are hardware rasterization challenges. Maybe there will be a good solution in the future. The bottom boundary of the edge equation I'm talking about is suitable for coarse-grained rasterization, but in some cases an error occurs (that is, you need to perform fine-grained rasterization in a block that does not overwrite any pixels). There are techniques to reduce this, but it is often more expensive to detect these special cases than to perform rasterization in blocks that do not overwrite any pixels. This is also a trade-off. The blocks used during rasterization are usually fixed on a grid (the next one will be more detailed). If a triangle covers two pixels across two tiles, you have to rasterize two 8x8 blocks. This is very inefficient. The above content seems simple, but not perfect, the actual triangular rasterization is not up to the theoretical peak (theoretically always assume that all the blocks are filled). Please keep this in mind.
Graphics Pipeline Tour Part6