Things are still simple, supposedly
Pip Install Pyopencl
But did not succeed, the error indicates that there is a mako not installed, although said not to install also does not matter, but think of no trouble on the installed, continue to error.
Seems to want to install PYOPENCL, you have to install OpenCL, so AMD website OpenCL SDK (2.9.1, version in the table, at a glance), the installation path seems to have no choice, in the Program Files x86 folder. Continue to error.
This time the error reporting vs in, I should be glad to install the VS Community version ... Said Cl/cl.h couldn't find it. The program you are trying to open is also called Cl ...
This is the size of the file Ah ... cl.exe, is compile + link? Write a HelloWorld to compile ... (The MASM in the compiler connection is also called CL) and then the compilation fails.
Tip is not found xxx.h, or xxx.lib, this tutorial is very easy to find, in the environment variables in the LIB and include folders are included in. So create a new include in the environment variable, a lib, and then follow the tutorial add a bunch of from Windows from the Library directory (as if mostly from the Microsoft SDK and Windows Kit), anyway, the last helloworld.c can be compiled, CL really is a one-click Compile connection, like GCC does not have to set the file name.
At this point, do you remember the AMD SDK folder (called the AMD APP SDK)? Open a look, there is also an include directory a Lib directory. Add the include to the environment variable directly, but there is a layer under the LIB, x86 or x86_64 you try it, I also do not understand why my 64-bit machine to use x86 ... A test method is to add a sentence at the beginning of helloworld.c #include <cl/cl.h>, if it can still be compiled successfully, then ...
Pip Install Pyopencl
At least I was so successful in installation, without warning.
Example program: (from http://ju.outofmemory.cn/entry/106475, here to change the py2 print to Py3)
# example provided by eilif mullerfrom __future__ import divisionkernel_code = "" "// thread block size#define block_size % (Block_ Size) d// matrix dimensions// (chosen as multiples of the thread block size for simplicity) #define wa % (w_a) d // matrix a width# define ha % (h_a) d // matrix a height#define wb % (w_b) d // Matrix b width#define hb wa // matrix b height#define wc wb // matrix c width#define hc ha // matrix c height/* * copyright 1993-2009 nvidia corporation. all rights reserved. * * NVIDIA Corporation and its licensors retain All intellectual property and * proprietary rights in and to this software and related documentation. * any use, reproduction, disclosure, or Distribution of this software * and related documentation without an express license agreement from * NVIDIA Corporation is Strictly prohibited. * * please refer to the applicable nvidia end user license agreement (EULA) * associated with this Source code for terms and conditions that govern * your use of this nvidia software. * *//* matrix multiplication: c = a * b. * device code. */#define &NBSP;AS (j, i) As[i + j * block_size] #define &nbsP;bs (j, i) bs[i + j * block_size]/////////////////////////////////////////////// ! matrix multiplication on the device: c = a * b//! wa is a ' S width and wb is b ' s Width////////////////////////////////////////////////////////////////////////////////__kernel __attribute __ ((Reqd_work_group_size (block_size,block_size,1))) Voidmatrixmul ( __global float* c, __ GLOBAL&NBSP;FLOAT*&NBSP;A,&NBSP;__GLOBAL&NBSP;FLOAT*&NBSP;B) { __local float as[block_size*block_size]; __local float bs[block_size*block_size]; // block index int bx = get_group_id (0); &NBSP;&NBSP;&NBSP;&NBSP;INT&NBSP;BY&NBSP;=&NBSP;GET_GROUP_ID (1); // thread Index int tx = get_local_id (0); int ty = get_local_id (1); // index of the first sub-matrix of a processed by the block int abegin = wa * block_size * by; // Index of the last sub-matrix of a processed by the block int aend = abegin + wa - 1; // step size used to iterate through the sub-matrices of a int astep = block_size; // index of the first sub-matrix of B processed by the block int bBegin = Block_size * bx; // step size used to iterate through the sub-matrices of b int bstep = block_size * wb; // csub is used to store the element of the block sub-matrix // that is computed by the thread float csub = 0.0f; // loop over all the sub-matrices of a and b // required to compute the block sub-matrix for (int a = abegin, b = bbegin; a <= aend; a += astep, b += bstep) { // Load the matrices From device memory // to shared memory ; each thread loads // one element of each matrix as (TY,&NBSP;TX) = A[a + wa * ty + tx]; bs (TY,&NBSP;TX) = B[b + WB * ty + tx]; // Synchronize to make sure the matrices are loaded barrier (clk_local_mem_fence); // Multiply the two matrices together; // each tHread computes one element // of the block sub-matrix for (int k = 0; &NBSP;K&NBSP;<&NBSP;BLOCK_SIZE;&NBSP;++K) csub += as (ty, k) * bs (K,&NBSP;TX); // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of a and b in the next iteration barrier (CLK_LOCAL_MEM_FENCE) ; } // write the block sub-matrix to device memory; // each thread writes one element c[ GET_GLOBAL_ID (1) * get_global_size (0) + get_global_id (0)] = csub;} "" " import pyopencl as clfrom time import timeimport numpyblock_size = 16ctx = cl.create_some_context () for dev in ctx.devices: Assert dev.local_mem_size > 0queue = cl. Commandqueue (ctx, properties=cl.command_queue_properties. profiling_enable) #queue = cl. Commandqueue (CTX) if false: a_height = 4096 #a_ height = 1024 a_width = 2048 #a_width = 256 #b_height == a_width b_width = a_ heightelif false: # like pycuda a_height = 2516 a_width = 1472 b_height = a_width b_width = 2144else: # cl sdk a_width = 50*block_ size a_height = 100*block_size b_width = 50* block_size b_height = a_widthc_width = b_widthc_height = A_heighth_a = numpy.random.rand (a_height, a_width). Astype (Numpy.float32) h_b = Numpy.random.rand (b_height, b_width). Astype (Numpy.float32) h_c = numpy.empty (c_height, c_ width). Astype (Numpy.float32) kernel_params = {"Block_size": block_size, "W_a":a_width, "h_a":a_height, "W_b":b_width}if "NVIDIA" in Queue.device.vendor: options = "-cl-mad-enable -cl-fast-relaxed-math" else: options = "" Prg = cl. Program (ctx, kernel_code % kernel_params, ). Build (Options=options) kernel = prg.matrixmul#print prg.binaries[0]assert a_width % block_size == 0assert a_height % block_size == 0assert b_width % block_size == 0# transfer host -> device ------------------ -----------------------------------mf = cl.mem_flagst1 = time () d_a_buf = cl. Buffer (CTX,&NBSP;MF. READ_ONLY&NBSP;|&NBSP;MF. COPY_HOST_PTR,&NBSP;HOSTBUF=H_A) d_b_buf = cl. Buffer (CTX,&NBSP;MF. READ_ONLY&NBSP;|&NBSP;MF. Copy_host_ptr, hostbuf=h_b) d_c_buf = cl. Buffer (CTX,&NBSP;MF. write_only, size=h_c.nbytes) push_time = time ()-t1# warmup ----------------------------------------------------------------------For i in range (5): event = kernel (queue, h_c.shape[::-1], (block_size, block_size), &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;D_C_BUF,&NBSP;D_A_BUF,&NBSP;D_B_BUF) event.wait () queue.finish () # actual benchmark -------------------------------- ----------------------------t1 = time () Count = 20for i in range (count): event = kernel (queue, h_c.shape[::-1], (block_size, block_ Size), d_c_buf, d_a_buf, d_b_ BUF) event.wait () gpu_time = (Time ()-t1)/count# transfer device -> host -- ---------------------------------------------------t1 = time () cl.enqueue_copy (queue, h_c, D_C_BUF) pull_time = time ()-t1# timing output --------------------------------------------------------------- Gpu_total_time = gpu_time+push_time+pull_timeprint ("Gpu push+compute+pull total [s]:" , gpu_total_time) print ("Gpu push [s]:", push_time) print ("Gpu pull [s]:", Pull_time) Print ("gpu compute (host-timed) [s]:", gpu_time) print ("gpu compute ( event-timed) [s]: ", (event.profile.end-event.profile.start) *1e-9) gflop = h_c.size * (a_width * 2.) / (1000**3.) Gflops = gflop / gpu_timeprint () Print ("GFLOPS/S:", gflops) # cpu comparison --------------------------------------------------------------t1 = time () h_c_cpu = Numpy.dot (H_a,h_b) cpu_time = time ()-t1print () print ("Gpu==cpu:", Numpy.allclose (H_C,&NBSP;H_C_CPU)) Print () print ("cpu time (s)", cpu_time) print () Print ("gpu speedup (With transfer): ", cpu_time/gpu_total_time) print ("GPU speedup (Without transfer): ", cpu_time/gpu_time)
Don't worry, if you start to see the prompt input, you might as well look at the options ... In square brackets is the number, followed by the content, you are responsible for entering the number enter. For example I entered two times 0, and finally there is a hint:
Choose Platform:[0] <pyopencl. Platform ' AMD accelerated Parallel processing ' at 0x7feee7e3168>choice [0]:0choose device (s): [0] <pyopencl. Device ' Capeverde ' on ' AMD accelerated Parallel processing ' at 0x9f28700>[1] <pyopencl. Device ' Intel (r) Pentium (r) CPU G4560 @ 3.50GHz ' on ' AMD accelerated Parallel processing ' at 0xa627610>choice, Comma-se parated [0]:0set the environment variable pyopencl_ctx= ' 0:0 ' to avoid being asked again.
This means that if you say yes in the environment variable, you don't have to choose. I configured the environment variable without effect, but added the following code to the beginning of the file--(thank Stackflow)
Import osos.environ[' pyopencl_ctx ' = ' 0:0 '
Run again, the above paragraph is gone, the result is direct. However, the code does not understand, the CPU==GPU is probably said CPU and GPU to calculate the results of the same bar, there are numpy printed array in the middle of an ellipsis ...
(2018-2-2 on Earth)
Win7 (AMD graphics card) installation Pyopencl