CUDA and cuda Programming

Source: Internet
Author: User
Tags nvcc

CUDA and cuda Programming
Introduction to CUDA Libraries

 

It is the location of the CUDA library. This article briefly introduces cuSPARSE, cuBLAS, cuFFT and cuRAND will introduce OpenACC later.

  • The cuSPARSE linear algebra library is mainly used for sparse matrices.
  • CuBLAS is a CUDA standard line generation library, but it does not have any operations specifically for sparse matrices.
  • CuFFT Fourier Transformation
  • CuRAND Random Number

There is no difference between the CUDA Library and the Library Used for CPU programming. It is a collection of interfaces. The main advantage is that you only need to write the host code and call the corresponding API, which can save a lot of development time. In addition, we can fully trust that these libraries can achieve good performance. Those who write these libraries are able to write on CUDA, which is not comparable to ordinary people. Of course, it is impossible to know nothing about CUDA performance optimization because it relies entirely on these libraries. We still need to manually make some improvements to find better performance.

It is a number of supported libraries mentioned in cuda c programming. For details, refer to the NVIDIA Developer Forum:

 

 

If your APP belongs to the application scope of the above database, we recommend that you use it.

A Common Library Workflow

The following is a specific process for using the CUDA library. Of course, the usage of each database may be different, but it will not escape the following steps. The difference is basically a few Fewer steps.

Below are some detailed explanations of these steps:

Stage1: Creating a Library Handle

Many CUDA libraries have a handle concept, which contains some context information of the library, such as data format and device usage. For the database using handle, the first step is to initialize such a thing. Generally, we can think that this is an object that is stored in the host and transparent to programmers. This object contains information associated with this library. For example, we may want all database operations to run in a special CUDA stream, although different libraries use different function names, however, most requests require that all database operations take place with a certain number of streams (for example, cuSPARSE uses cusparseSetSStream, cuBLAS uses cublasSetStream, and cuFFT uses cufftSetStream ). Stream information is saved in the handle.

Stage2: Allocating Device Memory

In the library mentioned in this Article, the allocation of the device bucket is still cudaMalloc or the library calls cudaMalloc by itself. Some custom APIs are used for memory allocation only when multiple GPU programming libraries are used.

Stage3: Converting Inputs to a Library-Supported Format

If the data format of the APP is different from the input format required by the Library, a conversion is required. For example, our APP stores a row-major 2D array, but the Library requires a column-major, which requires a conversion. In order to achieve optimal performance, we should try to avoid this conversion, that is, we should try to keep it consistent with the library format.

Stage4: Populating Device Memory with Inputs

After completing the preceding three steps, the host data is transmitted to the device, which is similar to cudaMemcpy. Similarly, most of the citation libraries have their own APIs to implement this function, instead of calling cudaMemcpy directly. For example, when using cuBLAS, We need to transmit a vector to the device, which uses cubalsSetVector. Of course, cudaMemcpy or other equivalent functions are called internally for transmission.

Stage5: refreshing the Library

As you can see in step 3, the data format is obviously a problem. library functions need to know what data format they should use. In some cases, data format information such as data dimensions is directly configured as function parameters. In other cases, you need to manually configure the handle of the database mentioned above. In other cases, we need to manage separated metadata objects.

Stage6: Executing

The execution is much simpler. Complete the previous steps, configure parameters, and directly call the library API.

Stage7: Retrieving Results from Device Memory

In this step, the calculation result is sent back from the device to the host. Of course, you still need to pay attention to the data format. This step is the inverse process of step 4.

Stage8: Converting Back to Native Format

If the calculation result is different from the original data format of the APP, a conversion is required. This step is the inverse process of step 3.

Stage9: Releasing CUDA Resources

If the memory resources used in the above steps are no longer used, they need to be released. As we have previously introduced, the allocation and release of memory are very heavy, so we hope to reuse the resources as much as possible. Such as device Memory, handles, and CUDA stream resources.

Stage10: Continuing with the Application

Continue.

I reiterate that the above steps may be very troublesome and inefficient for you to use the library, but in fact these steps are generally redundant. In many cases, many of these steps are unnecessary. In the following sections, we will introduce several major libraries and their brief usage. I believe that after reading these steps, you will not think that the usage of the database is not worth the candle.

THE CUSPARSE LIBRARY

CuSPARSE is a linear algebra database. It is used extensively for operations such as sparse matrices. It supports both dense and sparse data formats.

Is some function calls of the library, from which you can have a general understanding of its functions. CuSPARSE distinguishes functions by level. All level 1 functions only operate on dense and sparse vectors. All level2 functions operate on sparse matrices and dense vectors. All level3 functions operate on sparse and dense matrices.

 

CuSPARSE Data Storage Formats

The dense matrix indicates that most of the values are non-zero. All values of the dense matrix are stored in a multi-dimensional array. Relatively speaking, the elements in the sparse matrix and vector are mostly zero, so they can be stored in some articles. For example, we can only save the non-zero value and its coordinates. CuSPARSE supports many sparse matrix storage methods. This article only introduces three of them.

Let's take a look.Dense (dens) MatrixAs shown in the following figure:

 

Coordinate (COO)

For each non-zero value in the sparse matrix, COO stores its row and column coordinates. Therefore, when retrieving Matrix Values through columns, if the row and column values do not match in the storage format, they must be zero.

We should note that the so-called sparse matrix must be sparse to what extent to use COO? This requires specific analysis, mainly related to the element data type and index data type. For example, in a 32-bit sparse matrix storing floating point data, the index uses a 32-bit integer format, therefore, storage space is saved only when the non-zero data is less than 1/3 of the matrix.

 

Compressed Sparse Row (CSR)

CSR is similar to COO. The only difference is the row index with a non-zero value. In COO mode, all non-zero values correspond to an int row index, while CSR stores an offset value, which is the property of all values belonging to the same row. As shown in, row is reduced compared with COO:

 

Because all data stored in the same row is adjacent in the memory, only one offset and length are required to find the corresponding value of a row. For example, if you only want to know the non-zero value of the third row, you can use offset 2 and length 2 to search in V, as shown in:

 

Use the same offset and length for C in the figure to locate the column index, and then completely determine the position of a value in the matrix. When a large matrix is stored and each row contains a large amount of data, using CSR is much more effective than storing an index with a non-zero value.

Now we need to consider the storage of these offset addresses and length. The simplest way is to create two arrays Ro and Rl, each corresponding to an nRows for length. If the matrix has a large number of rows, two large arrays need to be allocated. In view of this, we can use an array R with a separate length of nRows + 1, and the offset address of line I is stored in R [I]. The length of line I can be determined by comparing the values of R [I + 1] and R [I, in addition, R [I + 1] is used to store the total number of non-zero values in the matrix. In this example, the R array is as follows:

 

It is known that the offset address of row 0 is the offset address of line 0, and the offset address of line 1 and line 2 is 2. There are 4 non-zero elements in total. We can find the value of matrix behavior 0 and its column index, since R [1]-R [0] = 1-0 = 1, it indicates that the first row has only one non-zero value, and its column index is 0, and its value is 3.

In this way, each row has multiple non-zero-value sparse matrix storage, and CSR saves space than COO. Is the complete CSR:

 

The function using the CSR sparse matrix is intuitive. First, we define a sparse matrix in CSR format on the host. The Code is as follows:

float *h_csrVals;int *h_csrCols;int *h_csrRows;

H_csrVals is used to store the number of non-zero values, h_csrCols stores column indexes, and h_csrRows stores row offsets. Next, we will allocate general operations such as device memory:

cudaMalloc((void **)&d_csrVals, n_vals * sizeof(float));cudaMalloc((void **)&d_csrCols, n_vals * sizeof(int));cudaMalloc((void **)&d_csrRows, (n_rows + 1) * sizeof(int));cudaMemcpy(d_csrVals, h_csrVals, n_vals * sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(d_csrCols, h_csrCols, n_vals * sizeof(int),cudaMemcpyHostToDevice);cudaMemcpy(d_csrRows, h_csrRows, (n_rows + 1) * sizeof(int),cudaMemcpyHostToDevice);

The preceding three data formats (including dense matrices) have their respective advantages. Lists some data formats supported by cuSPARSE and their respective best use cases:

 

Formatting Conversion with cuSPARSE

As we can see from the past, this process should be avoided as much as possible. Conversion requires not only computing overhead, but also extra storage space waste. Also, when using cuSPARSE, we should try to make full use of its advantages in sparse matrix storage, because the latency of many apps is simply using the dense matrix storage method. Because cuSPARSE has many data formats and many APIs are used for conversion, these conversion APIs are listed. The column on the left is the target format to be converted. If it is null, conversion between the two data formats is not supported, you can also implement unsupported conversion APIs through multiple conversions. For example, dense2bsr is not supported, but dense2csr and csr2bsr can be used to achieve the goal.

 

Demonstrating cuSPARSE

This sample code involves matrix vector multiplication, data format conversion, and other cuSPARSE features.

// Create the cuSPARSE handlecusparseCreate(&handle);// Allocate device memory for vectors and the dense form of the matrix A...// Construct a descriptor of the matrix AcusparseCreateMatDescr(&descr);cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL);cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO);// Transfer the input vectors and dense matrix A to the device...// Compute the number of non-zero elements in AcusparseSnnz(handle, CUSPARSE_DIRECTION_ROW, M, N, descr, dA,M, dNnzPerRow, &totalNnz);// Allocate device memory to store the sparse CSR representation of A...// Convert A from a dense formatting to a CSR formatting, using the GPUcusparseSdense2csr(handle, M, N, descr, dA, M, dNnzPerRow,dCsrValA, dCsrRowPtrA, dCsrColIndA);// Perform matrix-vector multiplication with the CSR-formatted matrix AcusparseScsrmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE,M, N, totalNnz, &alpha, descr, dCsrValA, dCsrRowPtrA,dCsrColIndA, dX, &beta, dY);// Copy the result vector back to the hostcudaMemcpy(Y, dY, sizeof(float) * M, cudaMemcpyDeviceToHost);

The process of the above Code can be summarized:

Compile:

$ Nvcc-lcusparse cusparse. cu-o cusparse

Important Topics in cuSPARSE Development

Although cuSPARSE provides a relatively fast and concise way to achieve a high-performance linear algebra library, we still need to remember the key points of cuSPARSE's use.

The first point is to ensure the correct data format of the matrix and vector. cuSPARSE itself has no ability to detect incorrect or inappropriate data formats, A wrong format operation may lead to a segment error. This provides debugging with a direction, although the segment errors are varied. If the matrix and vector are relatively small, it is feasible to manually verify the data format. We can perform a reverse conversion process to compare the converted data with the original data.

The second is the default asynchronous behavior of cuSPARSE. Of course, this is a habit for GPU programming, but the GPU computing results will be interesting for the traditional host-side congested mathematical library. For cuSPARSE, if cudaMemcpy is used to copy data, the host automatically blocks and waits for the computing result of the device. However, if the cuSPARSE library is configured to use CUDA steam and cudaMemcpyAsync, we need to keep an eye on it and use the correct synchronization behavior to obtain the computing result of the device.

The last novelty is the use of scalar. Here we use the reference form of scalar. Beta variables in the following code:

float beta = 4.0f;...// Perform matrix-vector multiplication with the CSR-formatted matrix AcusparseScsrmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE,M, N, totalNnz, &alpha, descr, dCsrValA, dCsrRowPtrA,dCsrColIndA, dX, &beta, dY);

If you accidentally pass the beta parameter directly, the APP will report an error (SEGFAULT). If you do not pay attention to it, this bug is hard to be checked. In addition, a pointer can be used when a scalar is used as an output parameter. CuSPARSE provides the cusparseSetPointMode API to adjust whether to use pointers to obtain calculation results.

THE cuBLAS LIBRARY

CuBLAS is also a line generation library, different from cuSPARSE, the interface of the traditional line generation library of cuBLAS, BLAS is the meaning of Basic Linear Algebra Subprograms. CuBLAS level1 is a specialized operation between vectors. Level2 is an operation between a matrix and a vector. Level3 is an operation between a matrix and a matrix. Compared with cuSPARSE, cuBLAS does not support sparse matrix data formats. It only supports dense matrices and vectors.

Since the BLAS library was originally written in FORTRAN, it uses column-major and one-based to store data, while cuSPARSE uses row-major. The storage format of this method is as follows:

 

We can compare the formula for converting row-major and column-major to one-dimensional process:

 

To ensure compatibility, cuBLAS also uses the column-major storage method. Therefore, it may be confusing for people who are used to C/C ++.

On the other hand, like C and other languages, the one-based index means that the reference of the first element in the array uses 1 rather than 0, that is, an array with N elements, the index of its last value is N rather than N-1.

However, cuBLAS cannot determine the context of C/C ++ (cuBLAS uses C/C ++) programming, so it must use the zero-based index, this leads to a strange chaos. To meet column-major in FORTRAN, one-based does not work.

CuBLAS proposed two APIs. The cuBLASASLegacy API is an initial implementation of cuBLAS and has been deprecated. Currently, the cuBLAS API is used, and there is little difference between the two.

After reading the following content, you will find that the usage process of cuBLAS is much the same as that of cuSPARSE. Therefore, you can write code in these libraries in a similar way.

Managing cuBLAS Data

Compared with cuSPARSE, the data format of cuBLAS is much simpler. All operations work on dense vectors or matrices. CudaMalloc is also used to allocate device memory space, but cublasSetVector/cublasGetVector and cubalsSetMartix/cublasGetMartix are used to transmit data between device and host (in fact, there is no big difference between cuSPARSE ). Essentially, the underlying layers of these Apis call cudaMemcpy, and they are well optimized for Strided and unstrided data, such as the following code:

CublasStatus_t cublasSetMatrix (int rows, int cols, int elementSize, const void * A, int lda, void * B, int ldb );

Most of these parameters are known by name. lda and ldb indicate the main dimension (leading dimension) of source matrix A and destination matrix B. The so-called main dimension is the total number of rows in the matrix, this parameter is useful only when a part of data in the host matrix is required. That is to say, when a complete matrix is required, both lda and ldb should be M.

If we use A dense two-dimensional column-major matrix A, the elements of which are single-precision floating point type and the matrix size is MxN, use the following function transmission matrix:

CublasSetMatrix (M, N, sizeof (float), A, M, dA, M );

You can also transmit A matrix A with only one column to A vector dV as follows:

CublasStatus_t cublasSetVector (int n, int elemSize, const void * x, int incx, void * y, int incy );

X indicates the source start address on the host, y indicates the start address of the target on the device, n indicates the total number of data to be transferred, and elemSize indicates the size of each element, in bytes, incx/incy is the address interval between elements to be transmitted, or the pace. It is used to transmit the column-major matrix A with A single column length M to the vector dV as follows:

CublasSetVector (M, sizeof (float), A, 1, dV, 1 );

You can also transmit A single row of matrix A to A vector dV as follows:

CublasSetVector (N, sizeof (float), A, M, dV, 1 );

Through these examples, we can find that using cuBLAS is much easier than using cuSPARSE, so unless our APP has a large demand for sparse matrix, we usually use cuBLAS to ensure performance at the same time, it can also improve development efficiency.

Demonstrating cuBLAS

This part of the Code focuses on some unified use of cuBLAS and understands why it is easy to use. Thanks to the high-performance computing of GPU, the performance is 15 times higher than the BLAS number on the CPU, and the development of cuBLAS is a little more troublesome than the traditional BLAS.

// Create the cuBLAS handlecublasCreate(&handle);// Allocate device memorycudaMalloc((void **)&dA, sizeof(float) * M * N);cudaMalloc((void **)&dX, sizeof(float) * N);cudaMalloc((void **)&dY, sizeof(float) * M);// Transfer inputs to the devicecublasSetVector(N, sizeof(float), X, 1, dX, 1);cublasSetVector(M, sizeof(float), Y, 1, dY, 1);cublasSetMatrix(M, N, sizeof(float), A, M, dA, M);// Execute the matrix-vector multiplicationcublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1,&beta, dY, 1);// Retrieve the output vector from the devicecublasGetVector(M, sizeof(float), dY, 1, Y, 1);

The usage of cuBLAS is intuitive and easy to understand. The procedure is as follows:

Compile command:

$ Nvcc-lcublas cublas. cu

Porting from BLAS

It is also intuitive to convert a traditional C-implemented APP (using the BLAS Library) to cuBLAS, which can be summarized into the following steps:

// Allocate device memorycudaMalloc((void **)&dA, sizeof(float) * M * N);cudaMalloc((void **)&dX, sizeof(float) * N);cudaMalloc((void **)&dY, sizeof(float) * M);// Transfer inputs to the devicecublasSetVector(N, sizeof(float), X, 1, dX, 1);cublasSetVector(M, sizeof(float), Y, 1, dY, 1);cublasSetMatrix(M, N, sizeof(float), A, M, dA, M);// Execute the matrix-vector multiplicationcublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1,&beta, dY, 1);// Retrieve the output vector from the devicecublasGetVector(M, sizeof(float), dY, 1, Y, 1);

The equivalent BLAS code is:

void cblas_sgemv(const CBLAS_ORDER order, const CBLAS_TRANSPOSE TransA,const MKL_INT M, const MKL_INT N, const float alpha, const float *A,const MKL_INT lda, const float *X, const MKL_INT incX, const float beta, float *Y,const MKL_INT incY);

There are still many similarities between the two. The difference is that BLAS has an order parameter that allows you to specify whether the input data is row-major or column-major. In addition, BLAS beta and alpha do not use the reference form,

4. The final step is to adjust the performance after implementing the function, for example:

  • Reuse device resources instead of releasing them.
  • Data transmission between device and host minimizes redundant data.
  • Stream-based execution is used for asynchronous transmission.
Important Topics in cuBLAS Development

Compared with cuSPARSE, if you are familiar with BLAS, cuBLAS is easier to use. However, it should be noted that although the behavior of cuBLAS is easier to understand, sometimes it is a natural understanding that may cause some misunderstandings. After all, cuBLAS is not equal to BLAS.

For most programming languages that are used to row-major, take extra care when using cuBLAS. We may be familiar with expanding a multi-dimensional array of row-major, however, the conversion from column-major to column-major is not suitable. The macro definition below can help us implement the conversion from row-major to column-major:

# Define R2C (r, c, nrows) (c) * (nrows) + (r ))

However, when using the above macro, we still need some cyclic sequence problems. For C/C ++ programmers, the following code is often used:

for (int r = 0; r < nrows; r++) {    for (int c = 0; c < ncols; c++) {        A[R2C(r, c, nrows)] = ...    }}

The code is fine, but not optimal, because it does not linearly scan the memory space when accessing. If nrows is very large, the cache hit rate is basically zero. Therefore, we need the following code:

for (int c = 0; c < ncols; c++) {    for (int r = 0; r < nrows; r++) {        A[R2C(r, c, nrows)] = ...    }}

Therefore, you must be careful when optimizing the cache, because a poor cache hit may occur if you do not pay attention to it.

CuFFT

Not complete to be continued ~~~

 

 

Reference: professional cuda c programming

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.