Original address
Https://github.com/apache/mesos/blob/master/docs/gpu-support.md
Mesos has fully supported Nvidia's GPUs in version 1.0.0.
Overview
When you understand a few key steps, running the GPU under Mesos is straightforward. One of them is to set up the necessary agent Flags, and let him enumerate the GPUs and give them to Mesos matser. On the other hand, we need to set up a reasonable framework capabilities for MESOS Master can give the GPU as system resources to the framework. In both cases, the container will use the GPU as a normal container, such as CPU, memory, and disk, as long as the container starts.
Above, Mesos will use the GPU as a normal hardware resource (CPU, disk, memory). Therefore, we can use the following resource offer to express.
Cpus:8; mem:1024; disk:65536; Gpus:4;
However, the GPU is not as easy to select as the general hardware facility (in GPU selection, we often want to select a specific GPU as the processor for the current task). If we execute the above statement, the Task_error error will be reported when the task is executed.
When we write this support, NVIDIA GPU supports only works on Mesos Containerizer (not supported in Docker Containerizer). That is, when Mesos Containerizer can run in Docker Containerizer native, the previous item limit has no effect on most users.
We also simulate the features of Nvidia's auto-mount in a Docker container. As a result, you can test GPU resources in Docker containers or deploy them without mesos modification.
In the following sections, we will explain the configuration of each of the necessary flags and framework capabilities in Mesos for NVIDIA GPU-related. Next we'll show examples in both Docker and non-Docker environments. Finally, we summarize a very detailed document to show you how to install some of the necessary Nvidia drivers on our machines.
Agent Flags
We need to set the following isolation flags to let the GPU support on the an agent.
--isolation="cgroups/devices,gpu/nvidia"
Cgroups/devices This marks the resource that the agent restricts access (/dev) at the time of task launches. When the previous statement combined Gpu/nvidia this flag, the previous Cgroups/devices flag allowed us to authorize and revoke the specific GPU at the Pre-task stage.
By default, all GPUs on the agent will be automatically indexed as daily resources and sent to Mesos master. However, there are times when we need to limit this resource (let only part of the GPU work). In this demand, We need the following statement to complete our specific requirements:
--nvidia_gpu_devices="<list_of_gpu_ids>"--resources="gpus:<num_gpus>"
Under –nvidia_gpu_devices flag, you need to enumerate the GPU with commas (,), you can use the Nvidia-smi command to view and decide which GPU the agent uses.
We give examples of nvidia-smi and flags, both of which are shown in the implementation status.
The GPU ID can be any true subset of IDs:
--nvidia_gpu_devices="0"--nvidia_gpu_devices="0,1"--nvidia_gpu_devices="1"
For –resources=gpus:flag. This GPU ID must be consistent with the number of –nvidia_gpu_devices and will be an error if the agent is not inconsistent. Special reminders here.
Framework Capabilities
Once the AGENT.GPU resource is started on the flags as above, it will be sent to Mesos master as a traditional resource. However, Master will only have GPU frameworks (Gpu_resources framework capability) Provides GPU compute resources.
This option ensures that the GPU's machine consumes non-GPU resources as much as possible (a situation that has little impact when each compute cell distributes the GPU, but the mixed system can be a big hassle).
We have provided a C + + version of setting up capability with the following code:
FrameworkInfo framework;framework.add_capabilities()->set_type( FrameworkInfo::Capability::GPU_RESOURCES=new MesosSchedulerDriver( &scheduler, framework, 127.0.0.1:5050);driver->run();
Minimal GPU capable Cluster
We'll show you how to perform a task in a GPU cluster. The first is that there is no Docker environment, and the second one is a docker environment. (different environment, same function).
Note: Two examples are assumed that you have installed the NVIDIA GPU in the case of all dependencies of Mesos. For dependency please pay attention to the bottom external-dependencies content.
Minimal Setup without support for Docker Containers
The following command line shows the most basic case of the run task on the Mesos cluster (localhost) that contains the GPU. The agent flags are already set up as above, and in the case of the Gpu_resources framework capability setting, we then execute this command using GPU resources.
$ mesos-master--ip=127.0. 0. 1 --work_dir=/var/lib/mesos $ mesos-agent--master=127.0. 0. 1 : 5050 --work_dir=/var/lib/mesos --isolation="Cgroups/devices,gpu/nvidia" $ mesos-execute--master=127.0. 0. 1 : 5050 --name=gpu-test--command="Nvidia-smi" --framework_capabilities="gpu_resources" --resources="gpus:1"
If everything is fine, you can see the following stdout output:
Minimal Setup with the support for Docker Containers
The following command line shows the most basic case of the run task on the Mesos cluster (localhost) that contains the GPU. The agent flags are already set up as above, and in the case of the Gpu_resources framework capability settings, it is also necessary to set the flag of the Docker containers.
$ mesos-master--ip=127.0. 0. 1 --work_dir=/var/lib/mesos $ mesos-agent--master=127.0. 0. 1 : 5050 --work_dir=/var/lib/mesos --image_providers=docker--executor_environment_variables="{}" --isolation="Docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" $ mesos-execute--master=127.0. 0. 1 : 5050 --name=gpu-test--docker_image=nvidia/cuda--command="Nvidia-smi" --framework_capabilities="gpu_resources" --resources="gpus:1"
If everything is OK, the following image appears:
External Dependencies
Any running mesos must have a legitimate Nvidia driver. It is also highly recommended to install the corresponding NVIDIA toolkit (NVIDIA CUDA). Many jobs rely on CUDA. There is no such thing as a problem or a big limitation when running the job.
Installing the Required Tools
Nvidia drivers can be downloaded from the link below, before downloading, determine the gpu,os that matches your machine and the Cuda toolkits you are ready to install.
Http://www.nvidia.com/Download/index.aspx
However, many Linux because pre-installed nouveau (open source video driver) will conflict with the Nvidia driver you are about to install, the following link can help you uninstall Nouveau.
Http://www.dedoimedo.com/computers/centos-7-nvidia.html
http://www.allaboutlinux.eu/remove-nouveau-and-install-nvidia-driver-in-ubuntu-15-04/
After installing the Nvidia driver, you can install the cuda-toolkits in the following way
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/
Additionally, based on the links above, we strongly recommend that Cuda's lib be added to Ldcache to facilitate the accuracy of Mesos task links. The specific operation commands are as follows:
sudo"cat > /etc/ld.so.conf.d/cuda-lib64.conf << EOF/usr/local/cuda/lib64EOF"sudo ldconfig
In particular, if you do not Ldcache Cudas lib. You must configure Ld_library_path. This is not a recommended method and may appear warning.
Verifying the installation
Once we have the NVIDIA driver installed, you can run the Nvidia-smi tool. View GPU scenarios.
nvidia-smi
Results
In addition, you can further review the installation according to the following links (recommended):
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/#install-samples
Finally, you'd better run the following Mesos unit tests below the NVIDIA GPU to make sure they all pass.
Running Mesos Unit Tests
Our existing unit tests are as follows:
dockertest.root_docker_nvidi a_gpu _deviceallowdockertest.root_docker_nvidiA_ GPU _inspectdevicesnvidiagputest.root_cgroups_nvidia_gpu _ Verifydeviceaccessnvidiagputest.root_internet_curl_cgroups_nvidiA_GPU _ Nvidiadockerimagenvidiagputest.root_cgroups_nvidia_gpu _ Fractionalresourcesnvidiagputest.nvidia_gpu _discoverynvidiagputest.root_ Cgroups_nvidia_gpu _flagvalidationnvidiagputest.nvidi a_gpu _allocatornvidiagputest.root_nvidiA_GPU _volumecreationnvidiagputest.root_nvidia_gpu _volumeshouldinject)
The '. ' After capitalization The letter flag specifies the filter when the unit test is run. The filters specified here include root, CGROUPS, Nvidia_gpu. This marks the need for the root user of the GPU and has CGROUPS permissions to execute. Some of these tests are designed to ensure the presence and availability of the NVIDIA GPU.
If these are true, you can use the following command to perform unit tests:
[mesos]$ GTEST_FILTER="" make -j check[mesos]$ sudo bin/mesos-tests.sh --gtest_filter="*NVIDIA_GPU*"
Mesos Nvidia GPU Support Translator