Deep learning Framework which strong: TensorFlow? Caffe? MXNet? Keras? Pytorch?

Source: Internet
Author: User

Deep learning Framework which strong: TensorFlow? Caffe? MXNet? Keras? Pytorch? Readers will be curious about the differences in performance of these frameworks as they run various deep tasks.
  
The results of the latest test by Microsoft Data scientist Ilia Karmanov show that Amazon Mxnet has strong performance on CNN, RNN and NLP sentiment analysis tasks, and TensorFlow is only good at feature extraction.
  
The test details are updated within the GitHub project Deeplearningframeworks (https://github.com/ilkarman/DeepLearningFrameworks) of Ilia Karmanov. However, the authors say that the test code in the project is not written specifically for deep learning performance, only to simply compare the performance differences between the frameworks.
  
The details of the project are as follows, with fun!
  
Translation | Liu
  
Edit | Donna
  
We did this list for fun, so we omitted a lot of important parts of the comparison. For example: Help and support, custom layers (can I create a capsule network?). ), Data Loader, debugging, different platform support, distributed training and so on.
  
We are not sure if we can make any recommendations for the overall performance of the framework, because this project is mainly about how to create the same neural networks in different frameworks.
  
For example, use Caffe2 to create a CNN in Python and then use Knet to replicate the network in Julia, or you can try to create a rnn in Pytorch and copy it in TensorFlow. You can perform some feature extraction in chainer and then copy the operation in CNTK.
  
Because the framework on the Microsoft Azure Deep Learning virtual Machine NC6 has been updated to the latest version, the notebooks code chooses to run on it, consuming only half the performance of the graphics Nvidia K80 GPU.
  
Test target
  
Rosetta Stone, who created the Deep Learning Framework (translator Note: A very useful foreign language learning software), enables data scientists to easily transfer their expertise from one framework to another (through translation, rather than learning from scratch). In addition, it is more transparent to compare the model training time with the default options.
  
Many online tutorials use very low-level APIs, although these APIs are very detailed, but for most use cases, it doesn't make much sense, since most of the time there are more high level helper programs available. Here, we try to use the highest level of API to make easier comparisons between frames by ignoring the default values of the conflicts directly.
  
The following results will prove that once a more advanced API is used, the code structure becomes very similar and can be broadly expressed as:
  
Loading data; X_train,x_test,y_train,y_test = www.qinlinyule.cn cifar_for_library (Channel_first =? , One_hot =? )
  
Generate CNN/RNN network structure (usually not activated on the last layer)
  
Specifies the loss function (the crossover entropy is specified together with the Softmax), the optimizer and initializes the network weight + Session
  
Train the training set in a mini-batch way and use a custom iterator (all frameworks use a common database)
  
Make predictions on the mini-batch of the test set
  
Calculation accuracy Rate
  
In essence, we are comparing a series of deterministic mathematical operations (albeit randomly initialized), so comparing the accuracy of cross-frames is meaningless. Instead, it will prompt us to check the desired match (? ) To ensure that we are comparing the same model schema.
  
Test results (November 24, 2017)
  
Train the CNN (www.cbl157.com Vgg type) network on the CIFAR-10 dataset
  
Performance comparison-image recognition
  
The model's input is a standard CIFAR-10 dataset, containing 50,000 training images and 10,000 test images, evenly distributed among 10 categories. The image of each 32x32 pixel is converted to tensor form (3,32,32), and the pixel value is normalized from 0-255 to 0-1. For example: Automotive image related Parameters y= (0,1,0,0,0,0,0,0,0,0), whose label is = [Airplane, car, bird, cat, deer, dog, frog, horse, boat, truck]
  
Training Rnn (GRU, gated Loop unit) on the IMDB data set
  
Performance comparison-Natural language processing (sentiment analysis)
  
The input to this model is the standard IMDB movie review DataSet, containing 25,000 training reviews and 25,000 test reviews, divided into 2 levels (positive/negative). The downloaded comment is already a tensor form of the word index, for example (if you like adult comic comics like South Park) will be represented as (1 2 3 4 5 6 3 7 8).
  
follows the processing of the Keras framework, where the starting character is set to 1, and the vocabulary (using a 30,000-size thesaurus) is represented as 2, so the word index starts at 3. Pin each comment to 150 words with 0 padding/truncation.
  
If possible, I will try to use CUDNN to optimize the RNN (controlled by the CUDNN = True switch), because we have a simple rnn that can be easily lowered to the CUDNN level. For example, for CNTK, we use optimized_rnnstack instead of the recurrence (LSTM (www.tkcyl1.com/)) function. Although it is not very flexible, but much faster.
  
For example, for CNTK, we can no longer use more complex variables such as layer normalization. In Pytorch, this is enabled by default. But for mxnet, I could not find such a rnn function, but instead use a slightly slower fused rnn function.
  
Keras has recently been supported by CUDNN, but only the TensorFlow backend can be used (not CNTK back end). TensorFlow has many RNN variants, including their own custom kernels. Here's a good benchmark, and I'll try to update the sample using Cudnnlstm instead of the current method.
  
Note: The CNTK framework supports dynamic axes, which means that we do not need to populate the input to 150 words, but because I cannot find a way to do this with other frameworks, I still use the Fill method. This is a bit unfair to the CNTK framework, because it will underestimate its ability.
  
The classification model creates an embedded matrix of size (150x125), then employs 100 gated loop units and outputs the final output (not the output sequence, nor the hidden state).
  
RESNET-50 (feature extraction) inference performance comparison
  
Load a pre-trained RESNET50 model and truncate the (7,7) vector at the end of the avg_pooling, outputting a 2048-dimensional vector. Here you can insert a SOFTMAX layer or other classifier, for example, using an excitation tree for migration learning. Here, the time for forward delivery to the Avg_pool layer on the CPU and GPU is counted.
  
What did I learn from it?
  
About CNN
  
Here are some insights that I've seen when comparing cross-frame test accuracy with the questions I've raised on GitHub.
  
1, the above example (except Keras), in order to facilitate comparison, try to use the same level of API, so all use the same generator function. For Mxnet and CNTK, I tried a higher-level API, where I used the framework's training generator function. In this case, the speed increase is negligible because the entire dataset is loaded into RAM as an NumPy array, and the data for each iteration is random at the time of processing. I suspect that the generator of the framework is executed asynchronously randomly.
  
Oddly, the random operation of the framework appears to be done at a batch level rather than at one observation level, thus slightly reducing the test accuracy (at least after 10 iterations). The custom generator will have a greater impact on performance for the input and output activities we will perform and the possible preprocessing and data enhancement in the run.
  
2, let CUDNN automatic adjustment/Exhaustive search parameters (can choose the most effective CNN algorithm to fix the size of the image) can bring a huge increase in performance. Both the Chainer,caffe2,pytorch and the Theano four frameworks must be started manually. Cntk,mxnet and TensorFlow three frames are enabled by default CUDNN.
  
Jiayanqing mentions performance improvements between cudnnget (default) and Cudnnfind. However, its differences on the Titanx GPU are much smaller.
  
Now it seems that the new CUDNN used on K80 + makes its performance difference more prominent. Because the target detection of a variety of image size combinations run Cudnnfind will appear a large performance degradation, so the exhaustive search algorithm should not be used in the target detection task.
  
3. When using Keras, it is important to select the [NCHW] sort that matches the back-end frame. CNTK is channels first, and I used to have the wrong configuration on Keras as channels last. This makes it necessary to change the order of each batch, while causing a significant decrease in performance. In general, [NHWC] is the default setting for most frameworks (such as TensorFlow), and [NCHW] is the best order to use when using CUDNN training on the NVIDIA GPU.
  
4, Tensorflow,pytorch,caffe2 and Theano four frameworks require a Boolean value to be provided to the dropout layer to indicate whether we are training, as this has a significant impact on the accuracy of the test set, with a ratio of 77% to five. Therefore, you should not use dropout to test in this case.
  
5, the use of the TensorFlow framework requires two changes: by enabling the TF_ENABLE_WINOGRAD_www.kaixinyle.com nonfused, but also change the supply to channel first instead of channel Last dimension (Data_format = ' Channels_first '). When you enable Winograd for convolution operations, it is natural to change Keras into TF as the backend.
  
6. The Softmax layer is usually used with the Cross_entropy_loss () function for most functions, and you need to check if you want to activate the final fully connected layer to save two times of use.
  
7. The kernel initializers of different frameworks may vary and have a ±1% effect on accuracy. I can specify Xavier/glorot as uniformly as possible, rather than too lengthy kernel initialization.
  
8, in order to sgd-momentum the implementation of the momentum type, I had to close the unit_gain. This is consistent with the implementation of other frameworks because it is closed by default on the CNTK framework.
  
9, Caffe2 to the first layer of the network (No_gradient_to_input = 1) For additional optimization, you can not calculate the input gradient produced a relatively small speed increase. TensorFlow and Mxnet may already have this feature enabled by default. The calculation of this gradient is useful for research and networks like Deep-dream.
  
10. Using Relu activation after max-pooling means that you perform a calculation after you reduce the dimension, which can be reduced by a few seconds. This can shorten the running time of the Mxnet framework by 3 seconds.
  
11. Some additional checks that may be useful:
  
Does the specified kernel (3) become a symmetric tuple (3,3) or 1-dimensional convolution (3,1)?
  
Is the step size (in Maximum pooling) the default to () or equal to the kernel (keras)?
  
The default padding is usually off (0,0) or valid, but it is useful to check that it is not on/' same '
  
Whether the default activation on the convolution layer is ' None ' or ' ReLu '
  
Initialization of bias values may not be possible (sometimes without bias values)
  
The descent of gradients and the processing of infinite values or nans may differ depending on the frame
  
Some frameworks support sparse tags, not one-hot encoded types (for example, I use TensorFlow with f.nn.sparse_softmax_cross_entropy_with_logits functions)
  
The assumptions of the data types may be different-for example, I tried to initialize X and Y with the float32 and int32 types. However, in Torch y requires a double type (which is to be used in torch. Data in Longtensor (y). Cuda function)
  
If the framework has a slightly lower-level API, make sure you do not calculate the gradient by setting training= false during testing.
  
12, it is said to install support python3.5 version of Caffe2 a little difficult. So I'm sharing a script here.
  
About RNN
  
1, most frameworks (such as TensorFlow), there are multiple RNN implementations/cores; Once lowered to the CUDNN lstm/gru level, the execution speed is the fastest. However, this implementation is less flexible (for example, you might want the layer to be normalized), and then there may be a problem if you run the inference on the CPU.
  
2, at the CUDNN level, most of the framework's running time is very similar. This nvidia blog article has written several interesting methods for CUDNN optimization of cyclic neural networks, such as fusion-"combining the computation of many small matrices with the computation of large matrices, and as much as possible, streaming the calculations to increase the ratio of memory I/O calculations to better performance on the GPU

Deep learning Framework which strong: TensorFlow? Caffe? MXNet? Keras? Pytorch?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.