Translation at will, some deletions, just to familiarize yourself with the basic interface (in fact, I didn't translate a few words)
GPU: devmem2d _
ClassGPU: devmem2d _
A lightweight class is used to represent the alignment data allocated on the global memory of the GPU. Applicable to users who need to write Cuda code by themselves
template <typename T> struct DevMem2D_{ int cols; int rows; T* data; size_t step; DevMem2D_() : cols(0), rows(0), data(0), step(0){}; DevMem2D_(int rows, int cols, T *data, size_t step); template <typename U> explicit DevMem2D_(const DevMem2D_<U>& d); typedef T elem_type; enum { elem_size = sizeof(elem_type) }; __CV_GPU_HOST_DEVICE__ size_t elemSize() const; /* returns pointer to the beginning of the given image row */ __CV_GPU_HOST_DEVICE__ T* ptr(int y = 0); __CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;};typedef DevMem2D_<unsigned char> DevMem2D;typedef DevMem2D_<float> DevMem2Df;typedef DevMem2D_<int> DevMem2Di;
GPU: ptrstep _
ClassGPU: ptrstep _
Similar to GPU: devmem2d. Because of the performance, this class only retains pointers and row steps, but does not keep the length and height.
template<typename T> struct PtrStep_{ T* data; size_t step; PtrStep_(); PtrStep_(const DevMem2D_<T>& mem); typedef T elem_type; enum { elem_size = sizeof(elem_type) }; __CV_GPU_HOST_DEVICE__ size_t elemSize() const; __CV_GPU_HOST_DEVICE__ T* ptr(int y = 0); __CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;};typedef PtrStep_<unsigned char> PtrStep;typedef PtrStep_<float> PtrStepf;typedef PtrStep_<int> PtrStepi;
GPU: ptrelemstep _
ClassGPU: ptrelemstep _ inherited from GPU: ptrstep _ similar to GPU: devmem2d _, but only contains pointers and row steps, excluding length and height. This class is only used when sizeof (t) is a multiple of 256. (I have never seen how constructor is implemented. To be honest, I don't quite understand the meaning of this class. My personal understanding is the compilation Optimization in Cuda)
template<typename T> struct PtrElemStep_ : public PtrStep_<T>{ PtrElemStep_(const DevMem2D_<T>& mem); __CV_GPU_HOST_DEVICE__ T* ptr(int y = 0); __CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;};
GPU: gpumatclass GPU: gpumat
- The basic storage class on GPU memory uses the reference counting technology.
- Restrictions:
- Only 2D supported
- No function to return the data address (because the return is useless, the CPU cannot directly operate)
- Expression templates technology is not supported, which may lead to a lot of memory allocation.
- Gpumat can be converted to devmem2d _ and ptrstep _, which can be directly used in the kernel function.
Comparison with Mat, gpumat: iscontinuous () = false, this is to ensure data alignment (Cuda merge access requirements) The gpumat of the PS row is continuous
Class cv_exports gpumat {public ://! Default constructor gpumat (); gpumat (INT rows, int cols, int type); gpumat (size, int type );.....//! Builds gpumat from mat. Blocks uploading to device. Explicit gpumat (const mat & M );//! Returns lightweight devmem2d _ structure for passing //! Template <class T> operator devmem2d _ <t> () const; Template <class T> operator ptrstep _ <t> () const ;//! Transmit data from the host to the GPU device. Void upload (const CV: mat & M); void upload (const cudamem & M, stream & stream );//! Transmit data from the GPU to host. Blocking CILS. Void download (CV: mat & M) const ;//! Download async void download (cudamem & M, stream & stream) const ;};
Note: the reason for the global or static gpumat is not explained. GPU: createcontinuous creates a continuous matrix in the GPU memory.
C++: void gpu::createContinuous(int rows, int cols, int type, GpuMat& m)C++: GpuMat gpu::createContinuous(int rows, int cols, int type)C++: void gpu::createContinuous(Size size, int type, GpuMat& m)C++: GpuMat gpu::createContinuous(Size size, int type)
Parameters:
Rows-row count.
Cols-column count.
Type-type of the matrix.
M-Destination matrix. This parameter changes only if it has a proper type and area ().
GPU: ensuresizeisenough
Make sure that the data space of the matrix is large enough. If not, allocate the space again. Otherwise, do not operate.
C ++: void GPU: ensuresizeisenough (INT rows, int cols, int type, gpumat & M)
C ++: void GPU: ensuresizeisenough (size, int type, gpumat & M)
Parameters:
Rows-Minimum desired number of rows.
Cols-Minimum desired number of columns.
Size-rows and coumns passed as a structure.
Type-desired matrix type.
M-Destination matrix.
GPU: registerpagelocked
Register the data to the page-locked type (for the mat on the host, this can accelerate data transmission, but the distribution of excessive page-locked memory will affect the system performance. For details, refer to the Cuda documentation)
C ++: void GPU: registerpagelocked (MAT & M)
Parameters:
M-Input matrix.
GPU: unregisterpagelocked
Cancel registration and set the data to pageable
C ++: void GPU: unregisterpagelocked (MAT & M)
Parameters:
M-Input matrix.
GPU: cudamem
ClassGPU: cudamem
This class uses technical reference technology and Cuda memory allocation function to allocate space (host ). This interface is similar to mat (), but has additional memory type parameters.
Alloc_page_locked sets a page locked memory type used commonly for fast and asynchronous uploading/downloading data from/to GPU.
Alloc_zerocopy specifies a zero copy memory allocation that enables mapping the host memory to GPU address space, if supported.
Alloc_write_combined sets the write combined buffer that is not cached by CPU. Such buffers are used to supply GPU with data when GPU only reads it. The advantage is a better CPU cache utilization.
For more information about the preceding parameters, see the Cuda documentation.
class CV_EXPORTS CudaMem{public: enum { ALLOC_PAGE_LOCKED = 1, ALLOC_ZEROCOPY = 2, ALLOC_WRITE_COMBINED = 4 }; CudaMem(Size size, int type, int alloc_type = ALLOC_PAGE_LOCKED); //! creates from cv::Mat with coping data explicit CudaMem(const Mat& m, int alloc_type = ALLOC_PAGE_LOCKED); ...... void create(Size size, int type, int alloc_type = ALLOC_PAGE_LOCKED); //! returns matrix header with disabled ref. counting for CudaMem data. Mat createMatHeader() const; operator Mat() const; //! maps host memory into device address space GpuMat createGpuMatHeader() const; operator GpuMat() const; //if host memory can be mapped to gpu address space; static bool canMapHostMemory(); int alloc_type;};
GPU: cudamem: creatematheader
Create only one header for the GPU: cudamem data, without the reference count Header
C ++: mat GPU: cudamem: creatematheader () const
GPU: cudamem: creategpumatheader
Map the CPU memory to the GPU address and create a GPU with no counting pointer: gpumat header pointing to it.
C ++: gpumat GPU: cudamem: creategpumatheader () const
Note that this method is only useful when alloc_zerocopy is used for memory allocation. It also requires support for specific models of hardware. Laptops often share video and CPU memory, so address spaces can be mapped, which eliminates an extra copy. (This feature has never been used in previous projects. I will not translate the following sentence to avoid mistakes)
GPU: cudamem: canmaphostmemory
Determine whether alloc_zerocopy is supported
C ++: static bool GPU: cudamem: canmaphostmemory ()
GPU: Stream
ClassGPU: Stream
Encapsulate stream in Cuda,
Note currently, you may face problems if an operation is enqueued twice with different data. some functions use the constant GPU memory, and next call may update the memory before the previous one has been finished. but calling different operations Asynchronously
Is safe because each operation has its own constant buffer. memory copy/upload/download/set operations to the buffers you hold are also safe.
Class cv_exports stream {public: stream ();~ Stream (); stream (const stream &); stream & operator = (const stream &); bool queryifcomplete (); void waitforcompletion ();//! Asynchronously transmits data to the host // CV: When mat must point to page locked memory (I. e. to cudamem data or to its submat) void enqueuedownload (const gpumat & SRC, cudamem & DST); void enqueuedownload (const gpumat & SRC, mat & DST );//! Asynchronously transmitted to the GPU // warning! CV: mat must point to page locked memory (I. e. to cudamem data or to its ROI) void enqueueupload (const cudamem & SRC, gpumat & DST); void enqueueupload (const mat & SRC, gpumat & DST ); void enqueuecopy (const gpumat & SRC, gpumat & DST); void enqueuememset (const gpumat & SRC, scalar Val); void enqueuememset (const gpumat & SRC, scalar Val, const gpumat & Mask); // conversion matrix type, ex from float to uchar depending on type void enqueueconvert (const gpumat & SRC, gpumat & DST, int type, double A = 1, double B = 0 );};
GPU: Stream: queryifcomplete
Determines whether the current stream queue is over. If yes, true is returned; otherwise, false is returned.
C ++: bool GPU: Stream: queryifcomplete ()
GPU: Stream: waitforcompletion
Keep the current CPU thread informed that all operations in the stream are completed.
C ++: void GPU: Stream: waitforcompletion ()
GPU: streamaccessor
ClassGPU: streamaccessor
Used to obtain cudastream_t from GPU: Stream
Class that enables getting cudastream_t from GPU: stream and is declared in stream_accessor.hpp because it is the only public header that depends on the Cuda Runtime API. Including it brings a dependency to your code.
struct StreamAccessor{ CV_EXPORTS static cudaStream_t getStream(const Stream& stream);};