For a long time did not get this thing, today suddenly want to try, the code did not finish, later fill.
1#include <stdio.h>2#include <stdlib.h>3#include <time.h>4#include <math.h>5#include <Windows.h>6 7 #defineM 10248 9 floatMata[m][m];Ten floatMatb[m][m]; One floatMatc[m][m]; A - voidInitmatrix (float*MatrixX) - { theRegisterinti; - for(i =0; I < M * m; i + + ) - { -*matrixx + + = (float) (rand ()% -) / +; + } - } + A voidMulmatrix (float* Matrixa,float* MATRIXB,float*Matrixc) at { -RegisterintI, J, K; -Registerfloat* p, *Q, F; - for(j =0; J < M; J + + ) - { - for(i =0; i < M; i + + ) in { -p = Matrixa + J *M; toQ = Matrixb +i; +f =0; - for(k =0; K < M; K + + ) the { *F + = *p * *Q; $P + +;Panax NotoginsengQ + =M; - } the +Matrixc[j * M + i] =F; A } the } + } - $ intMain () $ { - DWORD t; - //Register int i, J; the -Srand ((unsignedint) Time (NULL));Wuyi theInitmatrix ((float*) MatA); -Initmatrix ((float*) MATB); Wu -t =:: GetTickCount (); AboutMulmatrix ((float*) MatA, (float*) MatA, (float*) MatC); $T =:: GetTickCount ()-T; - - - /*For (j = 0; J < M; J + +) A { + For (i = 0; i < M; i + +) the { - printf ("%.2f", Matc[j][i]); $ } the printf ("\ n"); the }*/ the theprintf"time:%d\n", T); - in the return 0; the}
machine configuration E3 1231v3 mem:16g vs2010sp1 ICC 2015XE GTX660 in the future, Cuda will be brought together to test
1. CPU single thread only one O2
4750ms generally
Multithreading was originally measured, this time the code is not added. The 4 cores computed by physical cores should be about 6 seconds or so. Hyper-threading estimates will be better. Should be able to be about 5 seconds.
2. Single file to ICC compilation additional optimizations added/qipo/qparallel
Around 2600ms
Multithreading is still not measured, after
3. Cuda is not tested.
4.MKL not measured. A bit sorry this CPU. Oh, whim, must be mended in the future.
5. The more funny is, I was at the time of the whim, to change the MATRIXC related code to local, try to have no effect, this really has, on average 100ms less
It seems that the master has taught the cache hit is very reasonable.
The above code is changed, before changing to
voidMulmatrix (float* Matrixa,float* MATRIXB,float*MATRIXC) {RegisterintI, J, K, T; Registerfloat* p, *P; for(j =0; J < M; J + + ) { for(i =0; i < M; i + +) {p= Matrixa + J *M; Q= Matrixb +i; T= J * M +i; Matrixc[t]=0; for(k =0; K < M; K + +) {Matrixc[t]+ = *p * *Q; P++; Q+=M; } } }}
6. More funny is, put q + + M; Change m to 100 ..... Turned into the original 1/10.
Is it also the cache.
Performance of Intel Compiler