Forgive me for the mixed Chinese and English.
Now, I need multiple programs running at the same time, and each program runs GPU kernel multiple times. Can these kernels be executed in parallel? The answer is no parallel execution (unless using GPU multi-process server)
If the primary context is created by runtime, multiple threads of a program can be shared, and by using stream, multiple kernel can be implemented in parallel.
If it is a standard context created by driver, multiple threads of a program cannot be shared, and the context of one thread can be transferred to the context of another thread through the context migration.
If there are multiple programs processes, then it is not possible to share the context, meaning that these processes need to use the GPU serially.
The explanations are as follows:
First: What is the GPU context?
The CUDA device context is discussed in the Programming Guide. It represents all of the state (memory map, allocations, kernel definitions, and other state-related information) Associat Ed with a particular process (i.e. associated with this particular process ' use of a GPU). Separate processes would normally have separate contexts (as would separate devices), as these processes have independent GP U usage and independent memory maps.
If you have the multi-process usage of a GPU, you'll normally create multiple contexts on the GPU. As you ' ve discovered, it's possible to the create multiple contexts from a single process, and not usually necessary.
Multiple contexts, kernels launched in those contexts would require context switching to go from one Kernel in one context to another kernel in another context. Those kernels cannot run concurrently.
CUDA Runtime API usage manages contexts for you. You normally don't explicitly interact with a CUDA context when using the runtime API. However, in driver API usage, the context is explicitly created and managed.
Context swapping isn ' t a cheap operation. At least in Linux, multiple contexts compete for GPUs resources on a first come, first served basis. This include memory (there is no concept of swapping or paging). WDDM versions of Windows might work differently because there are an OS level GPU memory manager on play, but I don ' t has Any experience with it.
If you had a single GPU, I think you would does better running a persistent thread to hold the GPU context for the life of The application, and then feeds the thread work from producer threads. That's offers you the ability to impose you own scheduling logic on the GPU and explicitly control what work is processed. That was probably the Gpuworker model, but I am not very familiar with that code ' s inner workings.
Streams is a mechanism for emitting asynchronous commands to a single GPU context so that overlap can occur between CUDA function calls (for example copying during kernel execution). It doesn ' t break the basic 1:1 thread to device context paradigm This CUDA is based around. Kernel execution can ' t overlap on current hardware (the new Fermi hardware it's supposed to eliminate this restriction).
______________________ explained better ______________.
CUDA activity from independent host processes would normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes would take place in separate CUDA contexts, on the same Devic E.
CUDA activity in separate contexts would be serialized. The GPU would execute the activity from one process, and if that activity is idle, it can and would context-switch to Anot Her context to complete the CUDA activity launched from the other process. The detailed inter-context scheduling behavior is not specified. (Running multiple contexts on a single GPU also cannot normally violate basic GPU limits, such as memory availability for Device allocations.)
The "exception" to this case (serialization of GPU activity from independent host processes) would is the CUDA MULTI-PR Ocess Server. In a nutshell, the mps acts as a "funnel" to collect CUDA activity emanating from several host processes, and RU n that activity as if it emanated from a single host process. The principal benefit is to avoid the serialization of kernels which might otherwise was able to run concurrently. The canonical Use-case would is for launching multiple MPI ranks, all intend to use a single GPU resource.
Note that the above description applies to GPUs which is in the "Default" compute mode. GPUs in "Exclusive Process" or "Exclusive Thread" compute modes would reject any attempts to create more than one process/c Ontext on a single device. In one of the these modes, attempts by other processes to use a device already in use would result in a CUDA API reported failu Re. The compute mode is modifiable in some cases using the Nvidia-smi utility.
________________________________________________
A CUDA context is a virtual execution space, holds the code and data owned by a host thread or process. Only one context can ever is active on a GPUs with all current hardware. So to answer your first questionseven separate threads or processes all trying to establish a context and run on the same GPU Simultaneo usly, they'll be serialised and any process waiting for access to the GPU would be blocked until the owner of the running Context yields.There is, to the best of my knowledge, no time slicing and the scheduling heuristics be not documented and (I would suspe CT) not uniform from the operating system to operating system. You would is better to launch a single worker thread holding a GPU context and use messaging from the other threads to pus H work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that would work with threads From the same process, and the migration mechanism have latency and host CPU overhead.
About multiple programs simultaneously launch kernels on the same GPU