CUDA Linear memory allocation

Source: Internet
Author: User

Original link

Overview: Linear memory can be allocated via Cudamalloc (), Cudamallocpitch (), and Cudamalloc3d ()

1.1D Linear memory allocation

1 Cudamalloc (Void**,int)    //Allocate memory on device side 2 cudamemcpy (void* dest,void* source,int size,enum direction)    //Data copy 3 Cudamemcpytosymbol       //copy data into __constant__ variable, or __device__ variable 4 cudamemcpyfromsynbol   //Ibid. 5 cudafree ()               Memory Free 6 Cudamemset ()           //Memory initialization

Note: The data exchange between the host and the device is automatically synchronized, but the device and device do not, need to use Cudathreadsynchronize ()

2.2D Linear memory allocation

2.1 Allocation

1 Cudamallocpitch (void** devptr,size_t* pitch,size_t widthinbytes,size_t height)//Allocate a two-dimensional array in linear memory, the unit of width is byte, And the height unit is the data type

C Language application for 2-dimensional memory, is generally stored continuously. A[Y][X] is stored in the y*widthofx*sizeof (element) +x*sizeof (element) bytes.

But in Cuda's global memory access, from 256-byte aligned addresses (ADDR=0, 256, 512, ...) The start of continuous access is the most efficient . Thus, in order to improve the efficiency of memory access, there is the Cudamallocpitch function. The Cudamallocpitch function allocates memory in which the start address of the first element of each row of the array is guaranteed to be aligned. Because the number of data per row is indeterminate, widthofx*sizeof (element) is not necessarily a multiple of 256. Therefore, to ensure that the start address of the first element of each row of the array is aligned,Cudamallocpitch allocates some more bytes per line when allocating memory to ensure that the widthofx*sizeof (element) + multi-Allocated byte is a multiple of 256 (aligned). Thus, the address of the above y*widthofx*sizeof (element) +x*sizeof (Element) to calculate a[y][x] is incorrect. Instead, it should be y*[widthofx*sizeof (element) + multi-Allocated byte]+x*sizeof (element). The pitch value returned in the function is the widthofx*sizeof (element) + multi-allocated byte. Description:widthinbytes as the input parameter, should be widthofx*sizeof (element), so that the content should be copied to make corresponding changes.

2.2 Visits

1 t* pelement = (t*) ((char*) baseaddress + Row * pitch) + Column;           How elements are accessed

Cudamallocpitch () returns the spacing in the form of *pitch, which is the width of the allocated memory, in bytes. The spacing is used as a separate parameter for memory allocation to compute addresses within a 2D array.

2.3 Copy

1 cudamemcpy2d (void* dst,size_t dpitch,const void* src,size_t spitch,size_t width,size_t height,enum cudaMemcpyKind Kind )

There is a need to pay special attention to the difference between width and pitch, which is the width of the data that actually needs to be copied and pitch is the 2D linear storage allocation to Ching, and the host side pitch==width when data transfer occurs between the device and the host.

As we can see, Cuda's access to the two-dimensional linear space is not provided with multiple subscript support, while access is still calculated by the offset, the difference lies in the use of pitch alignment is very conducive to achieve coalesce access

Example: The following code assigns a two-dimensional floating-point array of size width*height, and demonstrates how to iterate over the elements of the arrays in device code

1//Host code 2   int width =, height = 3   float* devptr; 4   int pitch; 5   Cudamallocpitch ((void**) & ;d evptr, &pitch, Width * sizeof (float), height);  6   mykernel<<<100, 512>>> (devptr, pitch, width, height); 7//Device code 8   __global__ void Myke Rnel (float* devptr, int pitch, int width, int height) {9 for    (int r = 0; r < height; ++r) {ten       float* row = (Flo at*) ((char*) Devptr + R * pitch); one for       (int c = 0; c < width; ++c) {          float element = row[c];13       }14
   
    }15}
   

3, 3D linear memory

1 cudaerror_t cudamalloc3d (    2     struct cudapitchedptr *     pitcheddevptr,3     struct cudaextent             Extent     4)    

Example: The following code assigns a three-dimensional floating-point array of size width*height*depth, and demonstrates how to iterate over the elements of the arrays in device code

1//Host code 2 cudapitchedptr devpitchedptr; 3 cudaextent extent = Make_cudaextent (64, 64, 64); 4 Cudamalloc3d (&devpitchedptr, extent);   5 mykernel<<<100, 512>>> (devpitchedptr, extent); 6//Device code 7 __global__ void Mykernel (cudapitchedptr devpitchedptr, cudaextent extent) {8    char* devptr = Devpit Chedptr.ptr; 9    size_t pitch = devpitchedptr.pitch;10    size_t slicepitch = pitch * extent.height;11 for    (int z = 0; Z < EX tent.depth; ++z) {      char* slice = devptr + Z * slicepitch;13 for      (int y = 0; y < extent.height; ++y) {         float* row = (float*) (Slice + y * pitch);         (int x = 0; x < extent.width; ++x) {float element = row[x];16      }17    }18}

Category: Cuda Learning Notes

CUDA Linear memory allocation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.