Win7 (AMD graphics card) installation Pyopencl

Source: Internet
Author: User

Things are still simple, supposedly

Pip Install Pyopencl


But did not succeed, the error indicates that there is a mako not installed, although said not to install also does not matter, but think of no trouble on the installed, continue to error.


Seems to want to install PYOPENCL, you have to install OpenCL, so AMD website OpenCL SDK (2.9.1, version in the table, at a glance), the installation path seems to have no choice, in the Program Files x86 folder. Continue to error.


This time the error reporting vs in, I should be glad to install the VS Community version ... Said Cl/cl.h couldn't find it. The program you are trying to open is also called Cl ...


This is the size of the file Ah ... cl.exe, is compile + link? Write a HelloWorld to compile ... (The MASM in the compiler connection is also called CL) and then the compilation fails.


Tip is not found xxx.h, or xxx.lib, this tutorial is very easy to find, in the environment variables in the LIB and include folders are included in. So create a new include in the environment variable, a lib, and then follow the tutorial add a bunch of from Windows from the Library directory (as if mostly from the Microsoft SDK and Windows Kit), anyway, the last helloworld.c can be compiled, CL really is a one-click Compile connection, like GCC does not have to set the file name.


At this point, do you remember the AMD SDK folder (called the AMD APP SDK)? Open a look, there is also an include directory a Lib directory. Add the include to the environment variable directly, but there is a layer under the LIB, x86 or x86_64 you try it, I also do not understand why my 64-bit machine to use x86 ... A test method is to add a sentence at the beginning of helloworld.c #include <cl/cl.h>, if it can still be compiled successfully, then ...

Pip Install Pyopencl


At least I was so successful in installation, without warning.



Example program: (from http://ju.outofmemory.cn/entry/106475, here to change the py2 print to Py3)

# example provided by eilif mullerfrom __future__ import  divisionkernel_code =  "" "// thread block size#define block_size % (Block_ Size) d// matrix dimensions//  (chosen as multiples of the thread  block size for simplicity) #define  wa % (w_a) d // matrix a width# define ha % (h_a) d // matrix a height#define wb % (w_b) d //  Matrix b width#define hb wa  // matrix b height#define wc  wb  // matrix c width#define hc ha  // matrix c  height/* * copyright 1993-2009 nvidia corporation.  all rights  reserved. * * NVIDIA Corporation and its licensors retain  All intellectual property and * proprietary rights in and to this software and  related documentation. * any use, reproduction, disclosure, or  Distribution of this software * and related documentation without  an express license agreement from * NVIDIA Corporation is  Strictly prohibited. * * please refer to the applicable nvidia  end user license agreement  (EULA)  * associated with this  Source code for terms and conditions that govern * your use  of this nvidia software. * *//* matrix multiplication: c =  a * b. * device code. */#define &NBSP;AS (j, i)  As[i + j  * block_size] #define &nbsP;bs (j, i)  bs[i + j * block_size]/////////////////////////////////////////////// ! matrix multiplication on the device: c  = a * b//! wa is a ' S width and wb is b ' s  Width////////////////////////////////////////////////////////////////////////////////__kernel __attribute __ ((Reqd_work_group_size (block_size,block_size,1))) Voidmatrixmul ( __global float* c, __ GLOBAL&NBSP;FLOAT*&NBSP;A,&NBSP;__GLOBAL&NBSP;FLOAT*&NBSP;B) {    __local float  as[block_size*block_size];    __local float bs[block_size*block_size];     // block index    int bx = get_group_id (0); &NBSP;&NBSP;&NBSP;&NBSP;INT&NBSP;BY&NBSP;=&NBSP;GET_GROUP_ID (1);    // thread  Index    int tx = get_local_id (0);     int ty = get_local_id (1);     // index of the first sub-matrix of a processed  by the block    int abegin = wa * block_size  * by;    // Index of the last sub-matrix of  a processed by the block    int aend   =  abegin + wa - 1;    // step size used to  iterate through the sub-matrices of a    int astep   = block_size;    // index of the first sub-matrix  of B processed by the block    int bBegin =  Block_size * bx;    // step size used to iterate through the sub-matrices  of b    int bstep  = block_size * wb;     // csub is used to store the element of the block  sub-matrix    // that is computed by the thread     float csub = 0.0f;    // loop over all  the sub-matrices of a and b    // required to  compute the block sub-matrix    for  (int a =  abegin, b = bbegin;              a <= aend;             a  += astep, b += bstep)  {        // Load the matrices  From device memory        // to shared memory ;  each thread loads        // one element  of each matrix        as (TY,&NBSP;TX)  = A[a  + wa * ty + tx];        bs (TY,&NBSP;TX)  = B[b + WB * ty + tx];         // Synchronize to make sure the matrices are loaded         barrier (clk_local_mem_fence);         // Multiply the two matrices together;         // each tHread computes one element        // of the  block sub-matrix        for  (int k = 0; &NBSP;K&NBSP;&LT;&NBSP;BLOCK_SIZE;&NBSP;++K)              csub += as (ty, k)  * bs (K,&NBSP;TX);         // Synchronize to make sure that the preceding         // computation is done before loading two new         // sub-matrices of a and b in  the next iteration        barrier (CLK_LOCAL_MEM_FENCE) ;     }    // write the block sub-matrix to  device memory;    // each thread writes one element    c[ GET_GLOBAL_ID (1)  * get_global_size (0)  + get_global_id (0)] = csub;} "" " import pyopencl as clfrom time import timeimport numpyblock_size =  16ctx = cl.create_some_context () for dev in ctx.devices:     Assert dev.local_mem_size > 0queue = cl. Commandqueue (ctx,        properties=cl.command_queue_properties. profiling_enable) #queue  = cl. Commandqueue (CTX) if false:    a_height = 4096     #a_ height = 1024    a_width = 2048     #a_width  =  256     #b_height  == a_width    b_width = a_ heightelif false:    # like pycuda    a_height = 2516    a_width  = 1472    b_height = a_width    b_width =  2144else:    # cl sdk    a_width = 50*block_ size    a_height = 100*block_size    b_width = 50* block_size    b_height = a_widthc_width = b_widthc_height =  A_heighth_a = numpy.random.rand (a_height, a_width). Astype (Numpy.float32) h_b =  Numpy.random.rand (b_height, b_width). Astype (Numpy.float32) h_c = numpy.empty (c_height, c_ width). Astype (Numpy.float32) kernel_params = {"Block_size": block_size,          "W_a":a_width,  "h_a":a_height,  "W_b":b_width}if  "NVIDIA"  in  Queue.device.vendor:    options =  "-cl-mad-enable -cl-fast-relaxed-math" else:     options =  "" Prg = cl. Program (ctx, kernel_code % kernel_params,        ). Build (Options=options) kernel = prg.matrixmul#print prg.binaries[0]assert a_width %  block_size == 0assert a_height % block_size == 0assert b_width  % block_size == 0# transfer host -> device ------------------ -----------------------------------mf = cl.mem_flagst1 = time () d_a_buf = cl. Buffer (CTX,&NBSP;MF. READ_ONLY&NBSP;|&NBSP;MF. COPY_HOST_PTR,&NBSP;HOSTBUF=H_A) d_b_buf = cl. Buffer (CTX,&NBSP;MF. READ_ONLY&NBSP;|&NBSP;MF. Copy_host_ptr, hostbuf=h_b) d_c_buf = cl. Buffer (CTX,&NBSP;MF. write_only, size=h_c.nbytes) push_time = time ()-t1# warmup ----------------------------------------------------------------------For i in range (5):     event = kernel (queue, h_c.shape[::-1],  (block_size, block_size),  &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;D_C_BUF,&NBSP;D_A_BUF,&NBSP;D_B_BUF)      event.wait () queue.finish () # actual benchmark -------------------------------- ----------------------------t1 = time () Count = 20for i in range (count):     event = kernel (queue, h_c.shape[::-1],  (block_size, block_ Size),             d_c_buf, d_a_buf, d_b_ BUF) event.wait () gpu_time =  (Time ()-t1)/count# transfer device -> host -- ---------------------------------------------------t1 = time () cl.enqueue_copy (queue, h_c,  D_C_BUF) pull_time = time ()-t1# timing output --------------------------------------------------------------- Gpu_total_time = gpu_time+push_time+pull_timeprint ("Gpu push+compute+pull total [s]:" ,  gpu_total_time) print ("Gpu push [s]:",  push_time) print ("Gpu pull [s]:",  Pull_time) Print ("gpu compute  (host-timed)  [s]:",  gpu_time) print ("gpu compute  ( event-timed)  [s]:  ",  (event.profile.end-event.profile.start) *1e-9) gflop = h_c.size  *  (a_width * 2.)  /  (1000**3.) Gflops = gflop / gpu_timeprint () Print ("GFLOPS/S:",  gflops) # cpu comparison  --------------------------------------------------------------t1 = time () h_c_cpu =  Numpy.dot (H_a,h_b) cpu_time = time ()-t1print () print ("Gpu==cpu:", Numpy.allclose (H_C,&NBSP;H_C_CPU)) Print () print ("cpu time  (s)",  cpu_time) print ()   Print ("gpu speedup  (With transfer): ",  cpu_time/gpu_total_time) print ("GPU  speedup  (Without transfer):  ",  cpu_time/gpu_time)


Don't worry, if you start to see the prompt input, you might as well look at the options ... In square brackets is the number, followed by the content, you are responsible for entering the number enter. For example I entered two times 0, and finally there is a hint:

Choose Platform:[0] <pyopencl. Platform ' AMD accelerated Parallel processing ' at 0x7feee7e3168>choice [0]:0choose device (s): [0] <pyopencl. Device ' Capeverde ' on ' AMD accelerated Parallel processing ' at 0x9f28700>[1] <pyopencl. Device ' Intel (r) Pentium (r) CPU G4560 @ 3.50GHz ' on ' AMD accelerated Parallel processing ' at 0xa627610>choice, Comma-se parated [0]:0set the environment variable pyopencl_ctx= ' 0:0 ' to avoid being asked again.


This means that if you say yes in the environment variable, you don't have to choose. I configured the environment variable without effect, but added the following code to the beginning of the file--(thank Stackflow)

Import osos.environ[' pyopencl_ctx ' = ' 0:0 '


Run again, the above paragraph is gone, the result is direct. However, the code does not understand, the CPU==GPU is probably said CPU and GPU to calculate the results of the same bar, there are numpy printed array in the middle of an ellipsis ...


(2018-2-2 on Earth)

Win7 (AMD graphics card) installation Pyopencl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.