Test Platform
In my previous blog, I compared the performance of C # And C ++ in computing-intensive programs in VS2010. Link to the previous blog:
Http://www.cnblogs.com/ytyt2002ytyt/archive/2011/11/24/2261104.html
At that time, it was the result of a test with amd x-Dragon 9650 CPU (4 cores.
With the release of VS2012 and Intel Parallel Studio XE 2013, we will test the improvement of the VC11 compiler over VC10 and the performance differences between. net4.5. net4.0 and C. Fortran uses the latest Intel Parallel Studio XE 2013. In addition, as a well-established scientific computing language, Fortran also tries to focus on testing the performance differences from mainstream modern programming languages C ++ and C. Fortran, as the earliest programming language after compilation, is very convenient in matrix operations. It has long assumed the throne of performance for decades. Fortran 90/95 and Fortran2003/2008 have added a large number of modern language features and built in parallel support 20 years ago.
Test Platform:
CPU Intel Xeon E3 1230v2 3.5G 4-core 8 thread
Win7 64bit
Compiler:
C ++ vc11 (vs2012)
FORTRAN intel parallel studio Xe 2013
C #. net4.0. net4.5
Test code
However, to be fair, only one thread is used in the following tests, and there is no parallelism or matrix operation. All are default parameter compilation.
C # And C ++ code are the same as the previous test program
C ++ code:
C ++ Code # include <stdio. h> # include <stdlib. h> # include <time. h> # include <math. h> // provide # include <iostream> using namespace std; # define INTEG_FUNC (x) fabs (sin (x) for cin cout. // calculate the formula double dclock (void ); int main (void) {unsigned int I, j, N; double step, x_ I, sum; double start, finish, duration, clock_t; double interval_begin = 0.0; double interval_end = 2.0*3.141592653589793238; start = clock (); // initial time printf ("\ n"); printf ("Number of Chinese | Computed Integral | \ n "); // printf ("Interior Points | \ n"); for (j = 2; j <27; j ++) {N = 1 <j; step = (interval_end-interval_begin)/N; sum = INTEG_FUNC (interval_begin) * step/2.0; for (I = 1; I <N; I ++) {x_ I = I * step; sum + = INTEG_FUNC (x_ I) * step;} sum + = INTEG_FUNC (interval_end) * step/2.0; // printf ("% 10d | % 14e | \ n", N, sum); printf ("% 14e \ n", sum);} finish = clock (); // end time duration = (finish-start); printf ("\ n"); printf ("time = % 10e \ n", duration ); printf ("\ n"); int tempA; cin> tempA; return 0 ;}
C # code:
C # code using System; using System. collections. generic; using System. linq; using System. text; using System. threading. tasks; namespace ConsoleApplication1 {class Program {static void Main (string [] args) {int time = System. environment. tickCount; // Add a timer # region int I, j, N; double step, x_ I, sum; double start, finish, duration, clock_t; double interval_begin = 0.0; double interval_end = 2.0*3.141592653589793238; for (j = 2; j <27; j ++) {N = 1 <j; step = (interval_end-interval_begin)/N; sum = Math. abs (Math. sin (interval_begin) x step/2.0; for (I = 1; I <N; I ++) {x_ I = I * step; sum + = Math. abs (Math. sin (x_ I) * step;} sum + = Math. abs (Math. sin (interval_end) * step/2.0; Console. write (sum. toString () + "\ r \ n");} Console. write (System. environment. tickCount-time ). toString (); Console. readLine (); # endregion }}}
Fortran code:
Fortran code program ForAllProgramimplicit nonereal (8): time1, time2integer: I, j, k, Nreal (8): step, x_ I, sreal (8 ):: interval_begin = 0.0 real (8): interval_end = 2.0*3.141592653589793238 real, allocatable: ArrySum (:)! Call CPU_TIME (time1) do j = 2, 26N = 2 ** j! N = 1 <j; the bitwise operation replaces step = (interval_end-interval_begin)/N; s = Abs (Sin (interval_begin) * step/2.0; do I = 1, N-1! Here the corresponding C ++ <N is N-1x_ I = I * step; s = s + Abs (Sin (x_ I) * step; end dos = s + Abs (Sin (interval_end) * step/2.0; print *, send docall CPU_TIME (time2) print *, time2-time1end program
Note that in Fortran, the multiplication operator replaces the bitwise operation, and the Do loop to the N-1 corresponds to the <N
Test Results
Time unit: milliseconds
Time unit: the smaller the millisecond, the better.
Test conclusion
C # compared with. net 4.5 and. net 4.0, the performance is only slightly improved in the 32bit of. net4.5. The strange thing is that in. net4.5, the performance of 32bit is higher than that of 64bit.
C ++ has improved significantly in VS2012 than VS2010. Microsoft's C ++ CX performance may be similar to Intel's C ++ performance. 64bit performance is significantly higher than 32bit performance.
In computing-intensive problems, Fortran has terrible performance, and even exceeded my original imagination. Without any optimization, the performance exceeds 3 times of C ++, which is 5-6 times of C. The leader in numerical computation is not Fortran. This high performance may be due to the fact that Simd vectoring (AVX instruction set on the local machine) can be fully utilized by default ). C ++, even if Intel's vectorized compilation is enabled (Intel is enabled by default), it is difficult to fully implement automatic vectorization due to complicated syntaxes. You need to add vectorized compilation commands, such as # program simd, or even manually encode vectorized commands (such as optimization implementation in OpenCV ). In this way, the workload and complexity of program optimization will be greatly improved.
It can be seen that for large-scale scientific computing, Fortran is still the most suitable choice. In addition, a large number of existing mathematical computing class libraries are compiled by Fortran, And the syntax is relatively simple. It is indeed a perfect match for numerical computing.
C ++ has inherent advantages in interaction with the underlying system. C # is suitable for presentation layer development and overall architecture design, which is the most convenient and elegant.
Outlook
The next article will continue to test the performance of CPU parallelism and GPU acceleration. Based on past experience, GTX460 graphics cards can achieve 10-20 times the performance of a single CPU thread after optimization in float computing. However, considering the parallel performance of multiple CPU cores and Fortran, it is estimated that the ultimate advantage of GPU will not be that great, but it may be only 2-3 times better. For dual-precision computing, because the dual-precision of desktop graphics cards is only 1/8 of the single-precision (tesla computing card is 1/2, but expensive, the latest open puller 110 architecture tesla k20 and Titans is 1/3, the theoretical dual-precision exceeds 1 T), so it is estimated that the dual-precision of the core tesla can only reach 8-thread CPU Parallel 2-3, and the Kepler may be higher. However, this is just speculation that it will not be known until the next test.
Address: Yang Tao's learning memorandum http://www.cnblogs.com/ytyt2002ytyt/archive/2013/04/02/2996718.html.