Comparison between Caffe, TensorFlow, and MXnet open source libraries
Recently, Google opened up its internal deep learning framework TensorFlow [1] and discussed the three open-source libraries in combination with the open-source MXNet [2] and Caffe [3, among them, only Caffe has carefully read the source code. The other two libraries only read the official documentation and some comments from researchers. This article first compares the three libraries as a whole, we also discuss in detail the different data structures, computing methods, and gpu selection methods designed by the three methods. Table 1 is a record and comparison of some basic conditions of the three. The example indicates whether the example is easy to read and understand. TensorFlow directly installs the python package, so it does not go to the source code at the beginning, finding example from the document is not as direct as the source code of the other two. In fact, TensorFlow is more like a set of independent python interfaces. It not only supports CNN/RNN functions, but also has seen someone using it for Kmeans clustering. This table has obvious subjective factors for reference only.
(Updated in June 27, 2016) the first half of the year has passed. Looking back at this article, I feel a little sad. Caffe has not been updated for a long time, And tensorflow ended the previous dominant position, especially since version 0.8, tensorflow began to support distributed systems... MXNet is still so hard to support. Four Supported languages are added. Matlab/Javascripts/C ++/Scala makes the document more beautiful, demo [8] for mobile phone identification is also released.
(The original article is as follows)
Library name |
Development language |
Supported Interfaces |
Installation difficulty (Ubuntu) |
Document Style |
Example |
Supported Models |
Easy to get started |
Caffe |
C ++/cuda |
C ++/python/matlab |
* |
* |
* |
CNN |
** |
MXNet |
C ++/cuda |
Python/R/Julia |
** |
* |
** |
CNN/RNN |
* |
TensorFlow |
C ++/cuda/python |
C ++/python |
* |
** |
* |
CNN/RNN /... |
* |
- Installation difficulty:(Simple)-> **(Complex)
- Document Style:(General)-> **(Nice looking and comprehensive)
- Example:(Less given)-> **(Many and all)
- Easy to use:(Easy)-> **(Difficult)
1. Basic Data Structure
Library name |
Data Structure name |
Design Method |
Caffe |
Blob |
The stored data can be viewed as an n-dimensional c array, which has four dimensions (n, k, h, w). A blob contains two data spaces to store forward and backward derivative data. |
MXNet |
NDArray |
Provides cpu/gpu matrix and vector computing, enabling Automatic Parallel Processing |
TensorFlow |
Tensor |
Equivalent to an n-dimensional array or list, with variable dimensions. Once defined, the data type cannot be changed. |
Caffe's Data Storage Class blob, when the data can be seen as an n-dimensional c array, their storage space is continuous. For example, if the image is stored in 4 dimensions (num, channel, height, width), the variable (n, k, h, w) is stored in the array at (n * K + k) * H + h) * W + w. Blob has the following three features [4]:
- Two pieces of data, one is the original data, and the other is the derivative value diff.
- Two memory allocation methods: one is allocated to the cpu, and the other is allocated to the gpu, which is distinguished by prefix cpu and gpu.
- There are two access methods: data cannot be changed, and data can be changed.
Caffe makes me think that the most subtle thing is that a blob stores forward and backward data. Although in the Code itself, the forward data is changed because the input data is different, and the backward direction is changed because the forward data is different. According to the SRP principle, it is inappropriate to change data for two reasons in a class. However, on the logic level, the changes to the forward data lead to the difference in reverse direction. They are actually changing together and should be a whole. Therefore, I like this design very much. Although the two data are separated from each other in other frameworks, caffe2 does not know whether to retain the data.
MXNet NDArray is similar to numpy. ndarray. It also supports data allocation on the gpu or cpu for computation. However, unlike numpy and caffe, when operating NDArray, It can automatically allocate the data to be executed to multiple GPUs and CPUs for computing to achieve high-speed parallel processing. In the eyes of the caller, the Code may be a single thread. The data is only allocated to a piece of memory, but the execution process is actually parallel. Place commands (addition, subtraction, and so on) into an intermediate engine, and then the engine evaluates which data is dependent and which data can be processed in parallel. After defining the data, bind it to the network to process it.
TensorFlow's tensor, which is equivalent to an n-dimensional array or list. Similar to MXNet, it is displayed in the form of python calls. The data type of a defined tensor remains unchanged, but the dimension can be dynamically changed. Tensor rank and TensorShape are used to represent the dimension of tensor. For example, if rank is 2, it can be regarded as a matrix and rank is 1 as a vector ). Tensor is a relatively regular type. The only difference is that in a network composed of TensorFlow, tensor is the only type that can be passed, and it cannot be input like array or list.
It is worth mentioning that the data structure used by cuda-convnet is NVMatrix. NV indicates that the data is allocated to the gpu and all variables are processed as matrices. It only has two dimensions, it is the first deep learning framework implemented by cuda, and the above three frameworks all adopt the multi-dimensional variable dimension idea. This variable dimension is very effective when using matrices for Convolution operations.
2. Network implementation
Caffe is a typical function (process) computing method. It is based on every major function (visualization, loss function, nonlinear excitation, data layer) classify functions and implement corresponding parent classes for some functions, and then implement specific functions into child classes, or directly inherit Layer classes, thus forming the form of XXXLayer. Then combine different layers to form net.
Figure 1 caffe network structure (source [7])
MXNet is a mixture of Symbol computing and process computing [5]. It designs the symbol class and provides many Symbol operation interfaces. Each symbol defines how to process the data, symbol only defines the processing method. This step does not actually execute the operation. One thing to note is that the symbol contains Variable, which serves as a symbol for carrying data and defines the data to be transmitted to a Variable, bind the data to Variable in subsequent operations. The following code is a usage example. It connects the incentive function to the later part of the defined net, and gives the name of this symbol and the type of the incentive function to construct the net. The left part defines the collection of symbol, and binds the data to Variable in the middle, which turns into the real execution flow chart on the right.
net = mx.symbol.Activation(data=net, name='relu1', act_type="relu")
Figure 2 MXNet network structure (Figure source [2])
TensorFlow selects the symbolic computing mode. Its program is divided into the computing construction and execution stages. The constructor phase constructs the computation graph, computation graph is a flowchart that contains a series of symbolic Operation and Tensor Data Objects. Similar to the symbol of mxnet, it defines how to perform computation (addition, subtraction, multiplication, division, and so on), data flows through different computing sequences (that is, the flow of data between symbol operations ). However, for the moment, we do not read the input to calculate and obtain the output. Instead, we start the session run in the subsequent execution phase to execute the defined graph. This method is similar to mxnet and should be based on theano's ideas. TensorFlow also introduces the Variable type. Unlike mxnet's Variable, which belongs to symbol (tf operation is similar to mxnet's symbol), it is a separate type and mainly serves to store network weight parameters, this can be dynamically changed during the running process. Tf abstracts each Operation into an Operation symbol. It can read 0 or multiple Tensor objects as input (output ), the operation includes basic mathematical operations, support for reduce, and segment (operations on some of tensor. For example, if the tensor length is 10, you can calculate the first five, the middle two, and the following three sums at the same time.), the image resize, pad, crop, filpping, transposing, and so on. Tf does not provide a good graphic explanation or instance like mxnet (probably because I didn't find it ..), According to your understanding, I drew a part of the flowchart. It is a bit confusing why Variable is designed. In the alexnet example source code provided by tf, the input data and weight are set to Variable, and the output of each layer is not directly defined, according to tf, only the tensor type can be transmitted in the network, and the output type should be tensor. However, as the input and weight change, the output should also change. In this case, why not only design a tensor, but also make the tensor dynamically change.
Figure 3 TensorFlow computation graph
In terms of design, TensorFlow is more like a general machine learning framework than the other two, instead of just for cnn or rnn, but in terms of current performance, tf is faster than many open-source frameworks [6].
3. distributed training
Caffe and TensorFlow did not provide a distributed version. MXNet introduced the multi-host distributed architecture. Therefore, the first two versions only control the use of multiple GPUs. Caffe adds-Gpu 0, 1To call two gpu0 and 1, and only implement data parallelism, that is, execute the same network and different data on different GPUs, caffe instantiate multiple solver and net to double the batch_size of each processing. TensorFlow can define the gpu on which an operation is executed by callingWith tf. device ('/gpu: 2 ')Indicates that the next operation is to be processed on gpu2, which is also a parallel data operation. MXNet runs on several hosts by specifying the number of multi-host nodes during script execution, which is also a Data Parallel Operation. MXNet's multi-gpu allocation and data synchronization between them are completed through MXNet's Data Synchronization Control KVStore.
To use KVStore, you must first create a kv space, which is used to share data between different gpu hosts. The most basic operation is push and pull. push puts data into this space, pull retrieves data from this space. The key-value ([int, NDArray]) is saved in this space, and the key to be specified during push/pull. The following code accumulates the allocation of B [I] on different devices through key3 IN THE kv space and then outputs it to a, thus processing the multi-gpu. This is a great design that provides a great degree of freedom and reduces the trouble of controlling underlying data transmission for developers.
gpus = [mx.gpu(i) for i in range(4)] b = [mx.nd.ones(shape, gpu) for gpu in gpus]kv.push(3, b)kv.pull(3, out = a)
I have read a paper about how to train a convolutional network on multiple GPUs. There are two methods in this paper: common data parallelism and model parallelism. Model parallelism refers to dividing a complete network into different blocks and placing them on different GPUs for execution. Each gpu may only process 1/4 of a single graph. The concurrency of models is largely due to insufficient memory for the entire network. However, the performance of gpu functions is improved, and a gpu can effectively solve the problem of insufficient memory, in addition, model parallelism has additional communication overhead. Therefore, the open-source framework uses data parallelism to improve the degree of parallelism.
4. Summary
The above analysis and comparison are made on different aspects of the three frameworks. It can be seen that TensorFlow and MXNet have some similarities. They all want to build a more general deep learning framework, it seems that caffe2 will also use symbolic computing [5], indicating that the future framework will be more generic and efficient. My personal favorite is caffe, the architecture of cuda-convnet has also been written into the convolutional network. If you want to improve programming capability, you can look at the source code of these two frameworks. MXNet gives people the feeling that they are very careful, more efficient, and the documentation is very detailed, not only easy to get started, but also very flexible to use. TensorFlow is fully functional and can build a richer network than caffe is limited to CNN. In short, the frameworks all have their own merits, and the choice is based on personal preferences. However, the appearance of google's kill is still the biggest concern. Although the performance of a single machine is not good enough, however, looking at the long list of developers, we can only say that many of you are willful.
This article permanently updates the link address: