As more and more modern machine learning tasks require the use of GPUs, it is critical to understand the cost and performance of different GPU vendors for trade-off.
Start-up company Rare Technologies recently released a hyperscale machine learning benchmark that focuses on GPUs and compares the performance of machine learning costs, ease of use, stability, scalability and performance with several popular hardware providers.
On the 6 major GPU hardware platforms, perform the Twitter emotion classification task (approximately 1.5 million tweets, 4 periods) to train the cost of the two-way LSTM. As you can see from the above figure, dedicated servers are the best choice for controlling costs.
This benchmark compares the following hardware platforms: Amazon AWS EC2, Google Google Cloud Engine GCE, IBM Softlayer, Hetzner, Paperspace, and LeaderGPU, all of which provide credits and support during this test. At the time of the benchmark release, Microsoft Azure official did not respond, so it is regrettable that it was not included in the comparison.
However, this test covers a variety of different types of GPU platforms: virtual machines (AWS, GCE), bare-layer infrastructure (Softlayer), dedicated servers (Hetzner), and dedicated GPUaaS (LeaderGPU, Paperspace), Very comprehensive. The researchers also said they hope to pass the test to see if the high-end GPU is more valuable.
First, the results, after this test they found:
* These are the results of multi-GPU instances, using the multi_gpu_model function of multi_gpu_model to train the model on all GPUs, and later found insufficient utilization of multiple GPUs.
**For these reasons, these GPU models are only trained using one of the multi-GPU classes.
+ Hzzner is a monthly fee and provides a dedicated server.
Benchmark settings: Twitter text sentiment classification task
Next, we will discuss and compare all the platforms in detail, as well as the situation of this test.
Task This benchmark uses the sentiment classification task [1]. Specifically, the two-way LSTM is trained to do binary classification of Twitter tweets. The choice of algorithm is not very important, author Shiva Manne said that his only real requirement for this benchmark is whether the task should be GPU-intensive. To ensure maximum GPU utilization, he used the Keras Fast LSTM implementation supported by CuDNN (CuDNNLSM layer).
Dataset The Twitter Sentiment Analysis Dataset [2] contains 1,578,627 sub-categories of tweets, each line labeled "1" for positive emotions and "0" for negative emotions. The model trained 4 epochs on 90% (shuffled) data and the remaining 10% was used for model evaluation.
For reproducibility, Docker created an Nvidia Docker image that contains all the dependencies and data needed to re-run this benchmark. The Dockerfile and all the necessary code can be found in this Github[3] library.
Ordering and use: LeaderGPU, AWS, Paperspace is especially suitable for beginners
The ordering process on LeaderGPU and Paperspace is very smooth, without any complicated settings. Compared to AWS or GCE, Paperspace and LeaderGPU are available for a little longer (several minutes).
LeaderGPU, Amazon and Paperspace offer free Deep Learning Machine Images, pre-installed with the Nvidia driver, the Python development environment and Nvidia-Docker, which will start the experiment almost immediately. This makes things a lot easier, especially for beginners who just want to experiment with machine learning models. However, in order to assess how easy it is for a custom instance to meet individual needs, Manne sets everything up from scratch (except LeaderGPU). In the process, he discovered some common problems on various platforms. For example, the NVIDIA driver is not compatible with the installed gcc version, or after installing the driver, there is no evidence that the program is running, but the GPU usage is 100%.
Surprisingly, running Docker on the Paperspace low-end instance (P6000) caused an error, which was built by the source optimization (MSSE, MAVX, MFMA) by Tensorflow on Docker, which is not supported by the Paperspace instance. Running Docker without these optimizations can solve this problem.
As for stability, the performance of each family is very good, no problems encountered.
Cost: Dedicated servers are the best choice for cost control; cheaper GPUs are more cost effective
Unsurprisingly, dedicated servers are the best choice for cost control. This is because Hetzner charges monthly, which means that the price per hour is very low, and this number is prorated. So, as long as you have enough tasks to keep the server from being idle, choosing a dedicated server is correct.
Among the virtual machine vendors, Paperspace is the obvious winner. In the low-end GPU space, training models on Paperspace is twice as cheap as AWS ($1.6 vs $3.3). Paperspace further shows that there are similar cost-effective models in the high-end GPU segment.Benchmark Results: The cost of two-way LSTM training for Twitter sentiment classification tasks (approximately 1.5 million tweets, 4 periods) on various GPU hardware platforms.
Between AWS and GCE, the low-end GPU is slightly more expensive for AWS ($3.3 vs $2.4), but it is reversed in the high-end GPU space ($3.3 vs $3.4). This means that if you choose a high-end GPU, AWS may be better, and the part of the price you pay may be rewarded.
It should be noted that IBM Softlayer and LeaderGPU look expensive, mainly due to the underutilization of their multi-GPU instances. This benchmark is done using the Keras framework, so multi-GPU implementations are surprisingly low-efficiency, sometimes even worse than a single GPU running on the same machine. None of these platforms provide a single GPU instance. The benchmark running on Softlayer uses all available GPUs, using the multi_gpu_model function of multi_gpu_model, while the test on multi_gpu_model uses only one available GPU. This leads to underutilization of resources and a lot of extra costs.
In addition, LeaderGPU offers a more powerful GPU GTX 1080 Ti and Tesla V100 at the same price as GTX 1080 and Tesla P100 (per minute). Running on these servers will definitely reduce overall costs. In summary, the LeaderGPU in the chart, the low-end GPU cost part, is actually quite reasonable. If you're going to use a non-Keras framework and make better use of multiple GPUs, it's important to keep in mind.
There is also a general trend that cheaper GPUs are more cost-effective than more expensive GPUs, suggesting that the reduction in training time does not offset the increase in total cost.
Using Keras as a multi-GPU training model: Accelerating is difficult to predict
Since it is also said that using Keras to train multiple GPU models, let's say a few more words.
Many academics and industry people like to use advanced APIs like Keras to implement deep learning models. Keras itself is also very popular, with high acceptance and iterative updates. Users will think that using Keras does not require any additional processing and can speed up the conversion to multi-GPU models.
But the actual situation is not the case, as can be seen from the figure below.
Acceleration is quite unpredictable, and the "dual GTX 1080" server is clearly faster than the single GPU training on the "dual P100" server, but multi-GPU training takes longer. This situation has been raised in some blogs and Github issues, and is a noteworthy issue that Manne encountered in investigating costs.
Model accuracy, hardware pricing, spot evaluation and the latest experience
Model accuracy
We did an integrity test on the final accuracy of the model at the end of the training. It can be seen from Table 1 that the underlying hardware/platform has no effect on the training quality and the benchmark settings are correct.
Hardware pricing
GPU prices change frequently, but the K80 GPU (p2 instance) currently offered by AWS starts at $0.90/hour, charging in 1 second increments, while the more powerful and higher-performance Tesla V100 GPU (p3 instance) starts at 3.06. USD/hour. Additional services such as data transfer, elastic IP addresses, and EBS optimization instances are subject to additional charges. GCE is an economical alternative that offers K80 and P100 at $0.45/hour and $1.46/hour, respectively. These charges are in increments of one second and are rewarded by discount-based usage. Although different from AWS, they need to be attached to a CPU instance (n1-standard-1, priced at $0.0475/hour).
Paperspace competes with GCE in a low-cost alliance with a dedicated GPU of Quadro M4000, $0.40/hour, and a $2.3/hour Tesla V100. In addition to the usual hourly fees, they also charge a monthly fee ($5 per month) for services including storage and repair. On a millisecond-based paper space bill, additional services are available at a supplemental cost. Hetzner only offers one dedicated server with GTX 1080 per month and pays an additional setup fee.
IBM Softlayer is one of the few platforms on the market that offer bare metal servers with GPUs every month and hour. It offers 3 GPU servers (including Tesla M60s and K80s) starting at $2.80/hour. These servers have a static configuration, which means they have limited customization possibilities compared to other cloud providers. Soft computing results in hours are also very bad and can be more expensive for short-running tasks.
LeaderGPU is a relatively new player that offers dedicated servers for a variety of GPUs (P100s, V100s, GTX1080s, GTX1080Ti). Users can use the hourly or per minute pricing that is billed in seconds. The server has at least 2 GPUs and a maximum of 8 GPUs, with prices ranging from 0.02 Euros per minute to 0.08 Euros per minute.
Spot / preemptive instance
Some platforms offer significant discounts (50%-90%) on their alternate computing capacity (the AWS spot instance and the preemptive instance of GCE), although they may terminate unexpectedly at any time. This can result in highly unpredictable training times because there is no guarantee that the instance will start again. This is good for applications that can handle such terminals but have many tasks, and time-limited projects won't be good in this case (especially if you consider wasted labor time).
Running a task on a preemptive instance requires additional code to gracefully handle the termination and restart of the instance (checkpoint/storage data to a permanent disk, etc.). In addition, price volatility may result in costs that are largely dependent on capacity supply and demand during baseline operations. This will require multiple runs to average the cost. Given the limited time spent on benchmarking, I did not base my on-site/first instance.
Experience review
Paperspace seems to be a step ahead in performance and cost, especially for experiments that want deep learning techniques to draw similar conclusions in another benchmark.
Dedicated servers (such as servers provided by LeaderGPU) and bare metal servers (such as Hetzner) are suitable for users who are using these resources (doh) for a long time. But please note that due to the poor flexibility in customizing your server, make sure your task has a high CPU/GPU density to really feel value for money.
New players like Paperspace and LeaderGPU should not be fired because they can help cut most of the cost. Businesses may be reluctant to switch providers due to the associated inertia and switching costs, but these small platforms are definitely worth considering.
AWS and GCE are great choices for users looking to integrate with other services (AI integration - Amazon's Rekognition, Google's Cloud AI).
Unless you plan to complete the task in a few days, sticking to a low-end single GPU instance is the best option.
Higher-end GPUs run faster, but in fact the return on investment is even worse. These options should only be chosen when shorter training times (less development cycles) are more important than hardware costs.