Cuda Programming Practice--cublas

Last Update:2018-07-26 Source: Internet

Author: User

Tags first row

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In some applications we need to implement functions such as linear solvers, nonlinear optimizations, matrix analysis, and linear algebra in the GPU. The Cuda library provides a Blas linear algebra library, Cublas.

BLAS specifies a series of low-level lines that run common linear algebra operations, such as vector addition, constant multiplication, inner product, linear transformation, matrix multiplication, and so on. Blas has prepared a standard low-level example of de facto for linear algebra, examples of which include C and Fortran interfaces. So the Blas rule is universal, the implementation of BLAS will be optimized for the machine, the use of Blas can bring performance improvements, BLAS implementation of the use of special reading hardware such as vector register and SIMD instructions.

Website description

The Cublas Library is an BLAS implementation that allows users to use the computing resources of Nvidia's GPU. Cublas appears in CUDA6.0, now contains 2 classes of API, General Cublas, referred to as Cublas API, and the other is the CUBLASXT API. When using Cublas, the application should allocate the GPU memory space required by the matrix or vector, load the data, invoke the desired Cublas function, and then upload the computed results from the GPU's memory space to the host, Cublas API also provides some help functions to write or read data from the GPU.
When using the CUBLASXT API, the application should save the data on the host, and the Library will dispatch one or more GPUs in the system, which will be done according to the needs of the user.

cublasstatus_t cublaswinapi Cublassetmatrix (int rows, int cols, int elemsize, const void *a, int lda, void *b, int ldb);

The function copies a rectangular region from matrix A (located in the CPU) of rows x cols elements into the GPU's memory of Matrix B. Each element requires elemsize bytes of storage space.
Both matrices is assumed to being stored in column major format, with the leading dimension (i.e. number of rows) of source Matrix A provided in LDA, and the leading dimension of matrix B provided in LDB. In general, B points to a object, or part of a object, that is allocated via Cublasalloc ().

cublashandle_t handle;
Stat = cublascreate (&handle); if (stat! = cublas_status_success) {printf ("Cublas initialization failed\n"); return exit_failure;}

First create a handle to the Cublas library, using the function to initially cublas the context of the library. This handle needs to be passed explicitly to the API function that is called after. When using multiple host threads and multiple GPUs, the user can have more complete control over the settings of the library.

Stat = cublascreate (&handle); if (stat! = cublas_status_success) {printf ("Cublas initialization failed\n"); return exit_failure;}

the arrangement of elements in matrices

C language is stored on a row, in memory, the first line of matrix A is stored continuously, then the first row after the row is continuously stored, and so on, we call a of this storage structure is host_a.

So how will Cublas understand host_a?

Cublas, the matrix is stored in columns, so Cublas also understands the data in memory in the same way as column storage, and if we tell Cublas,host_a to store a matrix of M X N, Cublas will use the first element of the array as the first row of elements in the matrix, which is obviously not correct, because in C we think that the first n contiguous storage unit of HOST_A is the first line element of a. Compiling and linking

The file containing the header file "Cublas.h" and "cublas_v2.h" in the function call specify Cublas.so (Linux), the DLL cublas.dll common data types

Value	meaning
Cublas_data_float	The data type is 32-bit floating-point
Cublas_data_double	The data type is 64-bit floating-point
Cublas_data_half	The data type is 16-bit floating-point
Cublas_data_int8	The data type is 8-bit signed integer

Common Functions

matrix vector multiplication function	function
Cublasgbmv	Y=αop (a) x+βy) Y=\alpha op (a) X+\beta y)
Cublasgemv	Y=αop (a) x+βy) Y=\alpha op (a) X+\beta y)
CUBLASSYR2 ()	Corresponds to Rank-2 a=α (XYT+YXT) +a \mathbf{a}=\alpha (\mathbf{xy}^t+\mathbf{yx}^t) +\mathbf{a}
CUBLASTBMV ()	Triangular banded matrix vector multiplication
CUBLASTBSV ()	Triangular banded Linear system

OP (a) X=b op (a) \mathbf{x} = \mathbf{b}
CUBLASTPMV () | Triangular packed matrix vector multiplication X=op (a) x x = OP (a) x

The CUBLASXT API function gives a host interface that is compatible with multiple FPU, and when these API functions are used, the memory space required to allocate the matrix in the host is applied. There is no limit to the size of the matrix as long as they fit the host's storage space. The CUBLASXT API only supports BLAS3 strength calculations.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More