In some applications we need to implement functions such as linear solvers, nonlinear optimizations, matrix analysis, and linear algebra in the GPU. The Cuda library provides a Blas linear algebra library, Cublas.
BLAS specifies a series of low-level lines that run common linear algebra operations, such as vector addition, constant multiplication, inner product, linear transformation, matrix multiplication, and so on. Blas has prepared a standard low-level example of de facto for linear algebra, examples of which include C and Fortran interfaces. So the Blas rule is universal, the implementation of BLAS will be optimized for the machine, the use of Blas can bring performance improvements, BLAS implementation of the use of special reading hardware such as vector register and SIMD instructions.
Website description
The Cublas Library is an BLAS implementation that allows users to use the computing resources of Nvidia's GPU. Cublas appears in CUDA6.0, now contains 2 classes of API, General Cublas, referred to as Cublas API, and the other is the CUBLASXT API. When using Cublas, the application should allocate the GPU memory space required by the matrix or vector, load the data, invoke the desired Cublas function, and then upload the computed results from the GPU's memory space to the host, Cublas API also provides some help functions to write or read data from the GPU.
When using the CUBLASXT API, the application should save the data on the host, and the Library will dispatch one or more GPUs in the system, which will be done according to the needs of the user.
cublasstatus_t cublaswinapi Cublassetmatrix (int rows, int cols, int elemsize, const void *a, int lda, void *b, int ldb);
The function copies a rectangular region from matrix A (located in the CPU) of rows x cols elements into the GPU's memory of Matrix B. Each element requires elemsize bytes of storage space.
Both matrices is assumed to being stored in column major format, with the leading dimension (i.e. number of rows) of source Matrix A provided in LDA, and the leading dimension of matrix B provided in LDB. In general, B points to a object, or part of a object, that is allocated via Cublasalloc ().
cublashandle_t handle;
Stat = cublascreate (&handle); if (stat! = cublas_status_success) {printf ("Cublas initialization failed\n"); return exit_failure;}
First create a handle to the Cublas library, using the function to initially cublas the context of the library. This handle needs to be passed explicitly to the API function that is called after. When using multiple host threads and multiple GPUs, the user can have more complete control over the settings of the library.
Stat = cublascreate (&handle); if (stat! = cublas_status_success) {printf ("Cublas initialization failed\n"); return exit_failure;}
the arrangement of elements in matrices
C language is stored on a row, in memory, the first line of matrix A is stored continuously, then the first row after the row is continuously stored, and so on, we call a of this storage structure is host_a.
So how will Cublas understand host_a?
Cublas, the matrix is stored in columns, so Cublas also understands the data in memory in the same way as column storage, and if we tell Cublas,host_a to store a matrix of M X N, Cublas will use the first element of the array as the first row of elements in the matrix, which is obviously not correct, because in C we think that the first n contiguous storage unit of HOST_A is the first line element of a. Compiling and linking
The file containing the header file "Cublas.h" and "cublas_v2.h" in the function call specify Cublas.so (Linux), the DLL cublas.dll common data types
Value |
meaning |
Cublas_data_float |
The data type is 32-bit floating-point |
Cublas_data_double |
The data type is 64-bit floating-point |
Cublas_data_half |
The data type is 16-bit floating-point |
Cublas_data_int8 |
The data type is 8-bit signed integer |
Common Functions
matrix vector multiplication function |
function |
Cublasgbmv |
Y=αop (a) x+βy) Y=\alpha op (a) X+\beta y) |
Cublasgemv |
Y=αop (a) x+βy) Y=\alpha op (a) X+\beta y) |
CUBLASSYR2 () |
Corresponds to Rank-2 a=α (XYT+YXT) +a \mathbf{a}=\alpha (\mathbf{xy}^t+\mathbf{yx}^t) +\mathbf{a} |
CUBLASTBMV () |
Triangular banded matrix vector multiplication |
CUBLASTBSV () |
Triangular banded Linear system |
OP (a) X=b op (a) \mathbf{x} = \mathbf{b}
CUBLASTPMV () | Triangular packed matrix vector multiplication X=op (a) x x = OP (a) x
The CUBLASXT API function gives a host interface that is compatible with multiple FPU, and when these API functions are used, the memory space required to allocate the matrix in the host is applied. There is no limit to the size of the matrix as long as they fit the host's storage space. The CUBLASXT API only supports BLAS3 strength calculations.