Step by step program to optimize "3" OPENHMPP instructions (more flexible use of heterogeneous computing)

Source: Internet
Author: User

1. Brief introduction of HMPP

Hmpp refers to (Hybrid multicore Parallel programming), he is by caps (English) www.caps-entreprise.com.cn (Chinese)) initiated a standard for heterogeneous computing, and his presence can greatly reduce our program optimization time. You can refer to my previous several explain HMPP article to get HMPP trial version.

HMPP is a standard based on compilation guidance statements (similar to OpenMP), which differs from OpenMP in that OMP is a parallel standard based on CPUs, HMPP is a standard based on heterogeneous platforms (for example, cpu+gpu,cpu+mic) and supports C and Fortran two languages.

In addition, the HMPP compiler can generate Cuda code according to your #pragma instructions, or you can compile the Cuda code directly.

In short, the HMPP compiler is very powerful.

2, the use of HMPP and OPENACC a recommendation principle.

Using HMPP is to get dozens of or even thousands of times times faster than adding a small number of #pragma statements as much as possible without changing the original code. Of course, the premise is that your original code to be able to correctly follow the algorithm designed to execute the line.


3, continue to optimize the matrix multiplication of the code

1 to the side of the code need to be optimized: (Pay special attention to this code to value caps, this is the original code, I did not make substantial changes)

* * Copyright 2008-2012 CAPS entreprise.
 All rights reserved. * #include <getopt.h> #include <sys/time.h> #include <stdlib.h> #include <stdio.h> #include ;string.h> #include <math.h>/number of execution #define NB_RUNS 5//size of the matrix #define SIZE 256/ /initialization Random value #define SRAND_VALUE 5347//Use to initialize the matrix float randfloat (float low, float h
  igh) {Float t = (float) rand ()/(float) Rand_max;
Return (1.0F-T) * low + t * high; }//////////////////////////////////////////////////////////////////////////////////Sgemm_codelet////////////// void Mysgemm (int m, int n, int k, float alpha, float b  ETA, float a[m][n], float b[n][k], float c[m][k]) {int i,j,l;//induction variables float ab;
      Temporary result for (j = 0; J < m; J +) {for (i = 0; i < K; i++) {ab=0.0f; for (l = 0; l < N;
      l++) {ab + = a[j][l] * B[l][i];
    } C[j][i] = Alpha * AB + beta * c[j][i]; }}//////////////////////////////////////////////////////////////////////////////////Main program////////////// int main (int argc, char **argv) {int m=size, n=size

  , k = SIZE;
  Float *my_a=null, *b=null, *c_hwa=null, *c_cpu=null;
  int I, j, II; For timer measures struct timeval tv_global_begin, tv_global_end;  Global timer (all iterations) struct timeval tv_begin, tv_end;
  Local timer (1 iteration) unsigned long long int best_measure_gpu = 0;

  unsigned long long int sum_measure_gpu = 0;
  unsigned long long int best_measure_cpu = 0;

  unsigned long long int sum_measure_cpu = 0;
  unsigned long long int global_cpu_time = 0;

  unsigned long long int global_gpu_time = 0;

  unsigned long long int current;

  float alpha, beta;
  Double error = 0.0; int index_i = 0.0;
  int index_j = 0.0;
  Double valuecpu = 0.0;



  Double Valuegpu = 0.0;
  Allocating CPU memory My_a = (float *) malloc (m* n * sizeof (float));
  My_b = (float *) malloc (n * k * sizeof (float));
  C_hwa = (float *) malloc (M * k * sizeof (float));

  C_CPU = (float *) malloc (M * k * sizeof (float)); if ((my_a = NULL) | | (My_b = NULL) | | (C_hwa = NULL) | | (c_cpu = NULL))
    {fprintf (stderr, "\n**** error:memory allocation failed ****\n\n");
  return 1;
  } fprintf (stdout, "----initialization of the matrices----\ n \ nthe");

  Srand (Srand_value);  Generate options set for (i = 0; i < m i++) {for (j = 0; J < N; j +) {My_a[i*n+j] = Randfloat (0.0001f,
    1.0f);
    (i = 0; i < n; i++) {for (j = 0; J < K; J +) {My_b[i*k+j] = randfloat (0.0001f, 1.0f);
      for (i = 0; i < m. i++) {for (j = 0; J < K; j) {C_cpu[i*k+j] = randfloat (1.0, 20.0f);
    C_HWA[I*K+J] = C_cpu[i*k+j];
} alpha = 0.5;  Beta = randfloat (1.0, 2.0);


  fprintf (stdout, "----Running calculations----\ n");

  Run Sgemm on GPU (nb_runs iterations) printf ("Run on gpu\n");


  Start Timer gettimeofday (&tv_global_begin, NULL);
    For (i=0 i<nb_runs; i++) {printf ("%d", I);

    Gettimeofday (&tv_begin, NULL);
    Mysgemm (M, N, K, alpha, Beta, my_a, my_b, C_hwa);

    Gettimeofday (&tv_end, NULL);

    Current = (tv_end.tv_sec-tv_begin.tv_sec) *1e6 + tv_end.tv_usec-tv_begin.tv_usec; if ((Best_measure_gpu = 0) | | (Best_measure_gpu > Current))
    {Best_measure_gpu = current;
  } Sum_measure_gpu + = current;
  } gettimeofday (&tv_global_end, NULL); Global_gpu_time = (tv_global_end.tv_sec-tv_global_begin.tv_sec) *1e6 + tv_global_end.tv_usec-tv_global_begin.tv_
  USec

  Run Sgemm on CPUs (nb_runs iterations) printf ("\n\nrun on cpu\n");

  Start Timer gettimeofday (&tv_global_begin, NULL); For (i=0 i<nb_runs; i++) {printf ("%d", I);

    Gettimeofday (&tv_begin, NULL);

    Mysgemm (M, N, K, alpha, Beta, my_a, my_b, C_CPU);
    Gettimeofday (&tv_end, NULL);

    Current = (tv_end.tv_sec-tv_begin.tv_sec) *1e6 + tv_end.tv_usec-tv_begin.tv_usec; if ((best_measure_cpu = 0) | | (Best_measure_cpu > Current))  {        &NBSP;BEST_MEASURE_CPU = current;    }     sum_measure_cpu =
Current
 }   Gettimeofday (&tv_global_end, NULL);   Global_cpu_time = (tv_global_end.tv_sec-tv_global_begin.tv_sec) *1e6 + Tv_global_end.tv_usec-tv_global_


Begin.tv_usec;  //Compute error between GPU and cpu       for (ii = 0; II < m; ii++) {    fo R (j = 0; J < K; J +) {      double lerror = fabs ((c_hwa[ii*k+j]-c_cpu[ii*k+j))/c_cpu[ii*k+j])
;
      if (Lerror > Error) {        error = Lerror;         valuecpu = C_cpu[ii*k+j];
        Valuegpu = C_hwa[ii*k+j];
        index_i = II;
        Index_j = j;       }    }  }   if (Error > 2e-06) {    fprintf (
STDOUT, "\n\nthe error is%e with index%d:%d @%e (CPU)/%e (GPU) \ n", error, Index_i, Index_j, VALUECPU, VALUEGPU);
    fprintf (stdout, "The error is too big!\n");
    return-1;
 }   fprintf (stdout, "\ n \----Results----");
  fprintf (stdout, "\ n");
  fprintf (stdout, "sizes of matrices:m:%i  n:%i  k:%i\n\n", M, N, K);
  fprintf (stdout, "Best HWA time   :%f ms\n", best_measure_gpu/1e3);
  fprintf (stdout, "Mean HWA time   :%f ms\n", sum_measure_gpu/nb_runs/1e3);
  fprintf (stdout, "\ n");   fprintf (stdout, "Best CPU" time   :%f ms\n ", best_measure_cpu/1e3);
  fprintf (stdout, "Mean CPU time   :%f ms\n", sum_measure_cpu/nb_runs/1e3);
  fprintf (stdout, "\ n");
  fprintf (stdout, "Global HWA time :%f ms\n", global_gpu_time/1e3);
  fprintf (stdout, "Global CPU time :%f ms\n", global_cpu_time/1e3);
  fprintf (stdout, "\ n");   fprintf (stdout, "speed-up        :%f (computed on the best time)", &nbs


p;          ((float) best_measure_cpu)/best_measure_gpu);

  fprintf (stdout, "\ n");
  Free (my_a);
  Free (my_b);
  Free (C_hwa);

  Free (C_CPU);
  return 0; }  

Note that the above code, testing the results of two unified functions, the following add two simple instructions, and then compile the execution, look at the acceleration ratio.

Insert the statement in lines 31st and 32nd:

#pragma hmpp Mylab codelet, Target=cuda, Args[*].transfer=atcall

Insert in line 138:

#pragma hmpp Mylab callsite

Compile execution:

[] $hmpp--codelet-required gcc source.c

Execution results:

----initialization of the matrices----

----Running calculations----
run on GPU
0 1 2 3 4 

run on Cpu
  0 1 2 3 4 

----Results----
sizes of matrices:m:256  n:256 k:256 best HWA time    : 1.436000 ms
  mean HWA Time    : 21.837000 ms Best

CPU time    : 86.995000 ms
Mean CPU time    : 87.583000 Ms

Gl Obal HWA Time  : 109.192000 ms
Global CPU time  : 437.922000 ms

speed-up         : 60.581478 (computed on t He best time)

The acceleration ratio is 60 times times more. I just typed two lines of instructions.

Of course hmpp is not here so simple, it provides a lot of instructions, instruction learning is not difficult, that is, we do not have to learn directly cuda or OpenCL can be very convenient to use the GPU computing resources. All the benefits can only be known after you try it out.

I will also explain more instructions and some interesting details in the back of the blog. You are welcome to focus on Austria.








Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.