When we recently optimized our rendering engine, we found a strange phenomenon because we did the pre-z (to draw the larger object first, this time to turn off the color write, to turn on the depth test and write only, in order to reduce the calculation of some invisible pixels behind. ), the face in the drawing of another time (this is the same we have the same batch processing, using hardware instantiation technology (Hardware instancing)), will find some pixels will blink, this shows that the depth of the two calculations is not the same as the cause of the flicker problem, Knowing the cause of the problem, start looking for it.
In the beginning there are several guesses:
1, may be in the world to the GPU when the accuracy of missing a part of the accuracy, or on the GPU and CPU computing world-view-projection matrix There is a gap? This guess is wrong by modifying Microsoft's own instancing instance for validation.
2, because we are the vertex position is compressed into 16-bit floating point number, is not this caused? Again, we have modified the example from Microsoft to validate it, and this guess is also wrong.
3, is the result of two calculations different? After tracing, it was found that two times the matrix of multiplication is exactly the same, but again the result of some floating point number after several really is not the same. At this time the suspicion is not multithreading caused, wrote a test program, multiple threads run the same calculation results are the same, once again prove that the guess is wrong.
4, also suspected is the compiler to the floating-point calculation optimization problem, the compilation option to change the/fp:precise result or the same, the explanation with this speculation is also wrong.
5. Is the state modified when the floating-point operator x87 the FPU operation at the time of calculation? This time colleagues to help in the online search d3d9 floating point precision under the problem, this only found the problem, the original is d3d9 in order to optimize the ghost, it seems to d3d9 or not too familiar. D3d9 has a behaviorflags to control the creation of the device when it is created.
One of the tokens is that D3dcreate_fpu_preserve,d3d9 's explanation for it is this:
Set the precision for Direct3D floating-point calculations to the precision used by the calling thread. If You don't specify this flag, Direct3D defaults to single-precision Round-to-nearest mode for the reasons:
That is to say, d3d9 for efficiency, the use of D3d9 thread forced to modify the floating point arithmetic mark bit, so that in the calculation can be used to calculate the single-precision floating point number, there are two main reasons:
1. Double-mode will reduce the performance of D3D.
2. Some functions assume that floating-point element exceptions are marked, otherwise undefined behavior may occur.
In the FPU, there are three computational accuracy: Single precision (24bits), double precision (53bits), double extended precision (64bits). The default precision is a double precision of 53bits, which is the dual precision floating point. D3D for performance reasons, the FPU's calculation precision is changed to single precision. Because of the features associated with the FPU thread, all floating-point operations in the render thread will remain consistent with D3D. This shift is reflected in the change in the control register (CTRL) of the FPU, where the value of the CTRL register changes from 007F to 027F.
RC field, this field controls the conversion of floating-point-to-integer
00 = Round nearest or even 01 = round negative Infinity direction
10 = round to positive infinity General direction 11 = Ultra 0 Direction truncation
PC Field Precision Control
00 = Single Precision 01 = reserved 10 = Double precision 11 = extended precision 
Now that we have found the cause of the problem, we need to find a better solution:
1, the creation of the device to specify the D3dcreate_fpu_preserve tag, so that the d3d9 double-precision floating point operation, which will reduce the efficiency of d3d9 operation.
2, let the main thread and the rendering thread use a single precision floating point number calculation, you can use the functions provided by Microsoft _controlfp_s to achieve the purpose, so as to improve all the results, but there may be other unexpected accuracy problems, Another bad thing is that the client needs to handle this. For example, some people encounter a problem with LUA in the client. 
3. All floating-point operations that need to be passed to the GPU are performed in the render thread, theoretically, but not realistically.
I prefer to use the first method, the reason is explained above, do not know how you will solve this problem, welcome message discussion.Reference article:
 https://msdn.microsoft.com/en-gb/library/c9676k6h (v=vs.90). aspx
D3d9 floating-point accuracy problems