Basic questions

Imagine that our art made a 3D model, and then the rendering engine renders the model to the screen. We can also select different viewing angles and simulate different lighting conditions to observe the model. Now let's analyze the process. If we think of this process as a function, then the output of the function is the image on the screen. To be exact, every pixel on the screen. The main input to this function is the 3D model, the perspective we observe, the illumination situation, and so on. The rendering process is the process by which these factors determine each pixel value.

First look at the model. A model is usually made from a visual modeling software and looks like an "entity". From a computer perspective, however, a model file is simply a file containing the data it needs to be rendered. The details of objects in the real world are not exhaustive and extremely complex. So we simplify and simplify the object into a multipatch, further simplifying it into a polygon with triangles on each side. (for places with more surfaces and more detail, we can fit with more and smaller triangular faces). It is clear that the relative position of each vertex is important. It determines the shape of the model. Because of the need to describe the location, naturally there is the problem of choosing coordinate space. The space in which a model describes its own vertex coordinates is called model space. So we can conclude that the model file must contain the vertex coordinates under the model space.

Typically, in a rendered scene, multiple rendered objects are included. They are placed according to different positions, which of course make up different scenes. So in order to describe the relative position of the model in the scene, we will choose a coordinate space independent of the model, called World space. With world space, we can also describe the observer's position and the angle of observation. This observer, usually we can call it a camera. The resulting results can be viewed as images taken by the camera. To determine the outcome of the observation, it is obvious to know the position and orientation of the camera. Another point to note is that, depending on the imaging projection method, there are two types of cameras, orthogonal and perspective. In the orthogonal camera, it is impossible to judge the distance of the object from the camera, and there is no near-large relationship. According to the actual human eye observation results are the perspective camera. The farther away the object is from the camera, the smaller the image becomes.

The problem now is that the position of a point in a given world space, and the position of the camera are oriented. We want to make sure that this point is finally rendered to the image on which location (it may not be visible, it should also have an out-of-range coordinates) from this problem involved in the shader learning often meet the coordinate transformation. The first thing we care about is the position of the point relative to the camera. So naturally we can create a coordinate space based on the camera to describe the position of the person being observed. This space is referred to as view space, which is observed. According to OpenGL tradition, we take the camera to the right of the +x axis, above the +y axis, the direction of view is the z-axis. Thus the observation space is a right-hand system. In the object observed by the camera, the object z in the distance is smaller. The question now is how we can transform the coordinates in world space into view space. Obviously, this is an affine transformation, consisting of a linear transformation and a one-time translation. The 3x3 matrix can only describe linear transformations, and for the sake of unification, homogeneous coordinates are introduced. Extend the matrix of a linear transformation to a 4x4 matrix, adding a column (or row, depending on whether the row or column vectors are used) to characterize panning. Now the affine transformations of any two three-dimensional coordinate space can be described using a 4x4 matrix.

Given the coordinates of the vertex in the camera space, it is now necessary to determine the coordinates of the point projected on the screen. First we transform this coordinate into a left-handed clipping space, so that the coordinates of the point are only three-dimensional, and any point beyond the range of 1 to 1 is not within the visual range. (OpenGL tradition, DirectX is 0 to 1)

The visual range of the orthogonal camera is the coordinate satisfies the constraint

-far <= z <=-near,

-size <= y <= Size,

-size*aspect <= x <= size*aspect

The point. Here size and aspect determine the rectangular range that the camera sees, and far and near determine the depth range. Now we need to map the boundaries of three components to 1 and 1 through a transformation. To change the chiral, we let-far map to 1,-near mapping to-1.

Solving equations

-K * far + b = 1,

-K * near + B =-1.

The coefficients are k=-2/(far-near), B = (near + FAR)/(Near-far)

Then the transformation matrix is drawn

For a perspective camera, its visual range is a four-bevel, or cone-of-view. Defines the angle of the camera's vertical direction as the FOV, and the ratio of visual distance between horizontal and vertical directions is aspect. Its visual constraint is (note Z is negative)

Tan (FOV/2) *z*aspect <= x <=-tan (FOV/2) *z*aspect

Tan (FOV/2) *z <= y <=-tan (FOV/2) *z

-far<= Z <=-near

Here, because X, Y is also constrained by Z, we need to use the W component for homogeneous, and the third element of the last line of the matrix should be-1. The inverse number of the W component and the Z component before the transformation are made in the final result.

Similar to the above,-far maps to far,-near maps to-near.

Solving equations

K* (-far) + B = Far

K* (-near) + B =-near Draw k = (far + near)/(Near-far) b = 2far*near/(Near-far)

Thus the transformation matrix of the perspective camera is obtained.

The normalized coordinates (x, y, z) for the visible points, each of which is within the range of 1 to 1, can be obtained by using the vector of the column on the left side of the transformation matrix and then using the W component for homogeneous division. The pixel coordinates of the final render are obtained for x, y by dividing it by 1 by 2 and multiplying by the pixel width of the screen respectively. (OpenGL's screen coordinate origin is in the lower-left corner) the Z-value is the depth value of this point.

Now let's summarize how much we need to know to complete this series of transformations.

Vertex coordinates under model space----->> observation of vertex coordinates in space this step needs to know the relative position of the observer and the observer, in order to give the transformation matrix of the model space to the observation space. Obviously, after the model and camera have been placed in the scene, The matrix will naturally be known.

To observe the vertex coordinates of a space------->> trim Space This step only needs to know the camera-related parameters. The depth range far-near, the size of the orthogonal camera, the FOV of the perspective camera, and the width-to-height ratio aspect observed.

Trim Space----->> screen pixel coordinates only need to be specified as high as the width.

Unity Shader Learning Note (i) Coordinate transformation