Mxnet Design Notes: A comparison of programming patterns in deep learning

Source: Internet
Author: User
Tags theano mxnet

There is a wide variety of deep learning libraries in the market, which are of different styles. So what are the advantages and disadvantages of these library styles in terms of system optimization and user experience? This article aims to compare their differences in programming patterns, discuss the fundamental advantages and disadvantages of these patterns, and what lessons we can learn from them.

We focus primarily on the programming model itself, not on its specific implementation. Therefore, this article is not an article about the Deep learning library comparing each other. Instead, we divide these libraries into categories based on the interfaces they provide, and then discuss how the various forms of interfaces will affect the performance and flexibility of deep learning programming. The discussion in this article may not focus on deep learning, but we will use examples of deep learning to analyze and optimize.

Symbolic programming vs imperative programming

In this section, we will first compare the symbolic program (symbolic style programs) and the imperative program (imperative style programs) in two forms. If you are a python or C + + programmer, you should be familiar with imperative programs. The imperative program executes the operation according to our command. Most Python code is an imperative, such as the following numpy calculation.

Import NumPy as NPA = Np.ones (ten) b = Np.ones (TEN) * 2c = b * AD = c + 1

When the program executes to the line of C = b * A, the machine does a multiplication operation. Symbolic programs are slightly different. The following code belongs to the symbolic program, which also calculates the value of D.

A = Variable (' a ') B = Variable (' b ') c = b * AD = c + Constant (1) # compiles the functionf = compile (d) d = f (a=np.ones), b= Np.ones (10) * *)

The symbolic program differs in that when you execute a line of C = B * A, the program does not produce a real calculation, but instead generates a computed Graph/symbol graph (computation graph/symbolic graph) to describe the entire computational process. is to calculate the figure of D.

Most symbolic programs explicitly or implicitly include compilation steps. This step converts the calculation diagram to a function that can be called. The last line of code really does the operation. The most important feature of a symbolic program is that it clearly separates the steps of defining an operation diagram from the steps of a compilation operation.

In-depth learning library with imperative programming includes Torch,chainer, Minerva. The library with symbolic programming has Theano and CGT. Some libraries that use configuration files, such as Cxxnet and Caffe, are also considered symbolic programming. Because the contents of the configuration file define a calculation diagram.

Now that you understand the two programming models, let's go ahead and compare them!

More flexibility in imperative programs

This is not a strict statement, only to say that in most cases the imperative program is more flexible than the symbolic program. If you want to use Python to write a code for an imperative program, just write it. However, if you want to write code for a symbolic program, it is completely different. Look at the following imperative program, and think about how you can turn it into a symbolic program.

A = 2b = a + 1d = Np.zeros (ten) for I in range (d):    D + = Np.zeros (10)

You will find that the fact is not easy, because Python's for loop may not be supported by the symbolic program's API. If you use Python to write code for symbolic programs, it's definitely not a Python code. In fact, you're writing a domain-specific language (DSL) defined by the symbolic API. The symbolic API is an enhanced version of the DSL that can generate a calculation diagram or a configuration of a neural network. In this way, the libraries that enter the configuration file are symbolic.

Because imperative programs are more localized than symbolic programs, it is easier to take advantage of the features of the language itself and to interspersed them in the computational process. For example, the intermediate values of the printout calculation process, or the conditional judgment and looping properties of the host language.

Symbolic programs are more efficient

As we mentioned in the previous section of the discussion, imperative programs are more flexible and more localized for host languages. So why do most deep learning libraries choose symbols instead? The main reason is the efficiency of memory usage and computing time. Let's take a look at the small examples at the beginning of this article.

Import NumPy as NPA = Np.ones (ten) b = Np.ones (TEN) * 2c = b * AD = c + 1 ...

Assume that each cell of an array occupies 8 bytes. How much memory would it cost if we were to execute the above program in the Python console? Let's do some arithmetic together, first we need to store 4 arrays containing 10 elements, which requires 4 * 10 * 8 = 320 bytes. However, if we run the calculation diagram, we can reuse the memory of C and D, just 3 * 10 * 8 = 240 bytes of memory is enough.

More restrictions on symbolic programs. When the user compiles d, the user tells the system to just get the value of D. The computed intermediate result, which is the value of C, is not visible to the user. This allows the symbolic program to re-use memory for the same address computation (in-place computation).

However, imperative programs are a precautionary type. If the above program is executed in the Python console, then any one of the variables may be used, so the system cannot share the memory interval with these variables.

Of course, this assertion is somewhat idealized, because the imperative program starts the garbage collection mechanism when the variable goes out of scope, and the memory is reused. However, our ability to optimize is limited by the "precautionary" feature. Examples of gradient calculations are common, and we'll discuss them in the next section.

Another optimization point for symbolic programs is operation folding. In the code above, multiplication and addition operations can be collapsed into a single operation. As shown in. This means that if you use GPU computing, you need only one GPU core (not two). This is also the process of manually adjusting operations in the Cxxnet and Caffe these optimized libraries. Doing so can improve computational efficiency.

We can't do that in a command-type program. Because intermediate results may be referenced somewhere in the future. This optimization is possible in a symbolic program because we have a complete calculation diagram, with a definite line to the required and unwanted variables. and the imperative program only does local operations, without this clear line.

case studies of Backprop and Autodiff

In this section, we will compare two programming patterns based on the problem of automatic differentiation or reverse propagation. The gradient calculation is almost a problem to be solved by all deep learning libraries. Gradient calculations can be achieved using both imperative and symbolic programs.

Let's look at imperative programs first. The following code implements the automatic differential operation, which we discussed earlier.

Class Array (object): "" "Simple    Array Object this support Autodiff.    " " def __init__ (self, Value, Name=none):        self.value = value        if name:            Self.grad = lambda g: {name:g}    def __ Add__ (self, Other):        assert Isinstance (Other, int.)        ret = Array (Self.value + other)        Ret.grad = Lambda g:self.g Rad (g)        return ret    def __mul__ (self, Other):        assert Isinstance (other, array)        ret = Array (self.value * other.value)        def grad (g):            x = Self.grad (g * other.value)            x.update (Other.grad (g * self.value))            return x        Ret.grad = Grad        return ret# some Examplesa = array (1, ' a ') b = Array (2, ' B ') c = b * ad = c + 1print D.Va Lueprint D.grad (1) # results# 3# {' A ': 2, ' B ': 1}

In the above procedure, each array object contains the Grad function (in fact, the closure-closure). When we execute the D.grad, it invokes the Grad function recursively, propagates the gradient value back, and returns the gradient value of each input value. It seems a little complicated. Let's consider the process of calculating the gradient of a symbolic program. The following code is symbolic of the gradient calculation process.

A = Variable (' a ') B = Variable (' b ') c = b * AD = c + Constant (1) # Get gradient node.ga, GB = D.grad (wrt=[a, b]) # compiles th E Gradient function.f = compile ([GA, GB]) grad_a, grad_b = f (A=np.ones (Ten), B=np.ones (10) * *)

The Grad function of D generates a reverse calculation diagram and returns the gradient node GA and GB. They correspond to the red dots.

The imperative program does exactly the same thing as the symbolic formula. It implicitly stores a reverse calculation diagram in the Grad closure. When the D.grad is executed, we start with D (d), calculate the gradient back in the graph, and store the results.

So we find that both symbolic and imperative programs are consistent in their calculation of gradient patterns. So where are the differences between the two? Recall the imperative program "Rainy Day" requirements. If we prepare an array library that supports automatic differentiation, we need to save the grad closure in the calculation process. This means that all historical variables cannot be garbage collected because they are referenced by the variable d by the function closure. So what if we just want to calculate the value of D and not want the gradient value?

In the symbolic program, we declare f=compiled ([d>) to be replaced. It also declares the boundary of the calculation, telling the system that I only want to calculate the result of the forward path. The system can then release the storage space for the previous results and share the input and output memory.

Let's say we're not running a simple example, but an n-layer deep neural network. If we only calculate the forward path, instead of the reverse (gradient) path, we only need to allocate two temporary space to store the result of the middle layer instead of N. Since imperative programs need to be prepared for the gradient values that may be used in the future, the intermediate results have to be saved, and n temporary space is used.

As we can see, the degree of optimization depends on the constraints on user behavior. The idea of a symbolic program is to allow the user to explicitly specify the boundary of the calculation by compiling. And the imperative program prepares for all subsequent situations. The symbolic program is more fully aware of what the user needs and does not want, which is its natural advantage.

Of course, we can also impose constraints on imperative programs. For example, one of the solutions to these problems is to introduce a context variable. We can introduce a non-gradient context variable to avoid the calculation of gradient values. This brings more constraints to the imperative program in exchange for performance improvements.

With context. Nogradient ():    a = array (1, ' a ')    B = Array (2, ' B ')    C = b * a    D = C + 1

However, the above example has many possible future, that is, it is impossible to reuse memory (a common method of reducing GPU memory) by doing the same-address computation in the forward path. The techniques presented in this section produce an explicit reverse path. In kits such as Caffe and cxxnet, the reverse propagation is done implicitly in the same calculation diagram. The discussion in this section also applies to these examples.

Most of the configuration files based on function libraries, such as Cxxnet and Caffe, are designed for one or two common requirements. Calculates the activation function for each layer, or calculates the gradient of the weight of ownership. These libraries also face the same problem, and the more common compute operations a library can support, the fewer optimizations we can make (memory sharing), assuming that all are based on the same data structure.

Therefore, it is often possible to see examples of tradeoffs between constraint and flexibility.

Model Check Points

The ability to store and reload models is important for most users. There are many different ways to save the current work. Usually save a neural network, need to store two things, the configuration of the neural network structure and the weight of each node value.

Support for setting checkpoints on configuration files is an addition to the symbolic program. Since the symbolic model-building phase does not include a calculation step, we can directly serialize the calculation diagram and then reload it, eliminating the need to introduce additional layers to solve the problem of saving the configuration file.

A = Variable (' a ') B = Variable (' b ') c = b * AD = c + Constant (1) d.save (' mygraph ') ... D2 = Load (' mygraph ') f = compile ([D2]) # More operations ...

Because the imperative program performs the calculation on a line-by-row basis. We have to store the entire block of code as a configuration file, or add additional configuration layers at the top of the imperative language.

Parameter update

Most symbolic programming belongs to the data flow (calculation) diagram. The flow chart can easily describe the calculation process. However, it is not convenient to describe parameter updates, because updates to parameters can cause mutations (mutation), which are not part of the concept of data flow. Most symbolic programming is the introduction of a special UPDATE statement to update some of the persistence states of the program.

It is often easier to write parameter updates in an imperative style, especially if you need to update each other in a related way. For symbolic programming, the UPDATE statement is also called and executed by us. In a sense, most of the current symbolic deep learning library is also a return to the command-style method for updating operations, using symbolic method to calculate the gradient.

No strict boundaries.

We have compared the two programming styles. Some of the previous statements may not be entirely accurate, and there is no obvious boundary between the two programming styles. For example, we can compile imperative programs with the Python (JIT) compiler, which gives us some advantages of symbolic programming for global information. However, most of the arguments in the previous discussion are correct, and these constraints apply when we develop the deep learning library.

Big Operations vs small operations

We crossed the battleground of symbolic programs and imperative programs. Next, let's talk about some of the operations supported by the Deep learning Library. A variety of deep learning libraries typically support two types of operations.

    • Large layers of operations, such as fullyconnected and batchnormalize
    • Small operations, such as element-wise addition, multiplication. Libraries such as cxxnet and Caffe support layer-level operations, while libraries such as Theano and Minerva support fine-grained operations.

More flexibility for smaller operations

Obviously, because we can always combine fine-grained operations to achieve larger operations. For example, the sigmoid function can be easily split into division and exponential operations.

Sigmoid (x) =1.0/(1.0+exp (×))

If we use a small operation as a module, then we can represent most of the problems. For readers who are more familiar with Cxxnet and Caffe, these operations and layer-level operations indistinguishable only have finer granularity.

Sigmoidlayer (x) = Ewisedivisionlayer (1.0, Addscalarlayer (Explayer (x), 1.0))

Thus the above expression becomes a combination of three layers, each of which defines their forward and reverse (gradient) functions. This gives us the convenience of building a new layer, because we just need to put these things together.

Large operation more efficient

As you can see, the direct implementation of the sigmoid layer means that three levels of operation are required, not one.

Sigmoidlayer (x) =ewisedivisionlayer (1.0,addscalarlayer (Explayer (-X), 1.0))

This increases the compute and memory overhead (which can be optimized).

So libraries such as cxxnet and Caffe use a different approach. In order to directly support more coarse-grained operations, such as Batchnormalization and Sigmoidlayer, the compute cores are set artificially in each layer, and only one or a few cuda cores are started. This makes the implementation more efficient.

Compiling and optimizing

Can small operations be optimized? Of course. This involves the system optimizations section of the compilation engine. There are two optimization forms of the calculation diagram

    • Memory allocation is optimized to reuse the memory of intermediate results.
    • The computation fusion, the detection diagram contains the sigmoid and so on the pattern, merges it into the larger computation nucleus. In fact, memory allocation optimization is not limited to small operations, but also to larger computing graphs.

However, these optimizations do not matter to large operations libraries such as Cxxnet and Caffe. Because you never notice the build steps inside them. In fact, these libraries contain a compiled step that transforms each layer into a fixed forward, back-to-execute plan, executed one at a time.

These optimizations are critical for calculation diagrams that contain small operations. Because each operation is small, many of the sub-graph patterns can be matched. Also, because the resulting operation may not be fully enumerated, it requires the kernel to be explicitly recompiled, in contrast to the precompiled kernel fixed by the large operations library. This is why the symbolic library supports the overhead of small operations. The need to compile optimizations also increases the engineering overhead of supporting only small operations libraries.

As with symbolic and imperative examples, the large operations library requires the user to provide constraints (to the public layer) to "cheat", so that the user is the person who actually completes the sub-graph match. So the human brain saves the extra overhead of compiling, usually not too bad.

Expression templates and static type languages

We often need to write a few small operations and then close them together. Libraries such as Caffe use a manually provisioned kernel to assemble these larger modules, or the user will have to complete the assembly on the Python side.

In fact, we have a third option, and it's very useful. It is called an expression template. The basic idea is to use template programming at compile time to generate a generic kernel from an expression tree. For more details, go to the expression template tutorial. Cxxnet is a library that uses expression templates extensively, making code more concise, easier to read, and comparable in performance to manually provisioned cores.

The difference between an expression template and a Python kernel generation is that the expression template is done at C + + compile time and has a ready-made type, so there is no additional overhead for the run time. In theory, other statically typed languages that support templates have this property, but so far we've only seen them in C + +.

The Expression Template Library opens up a middle ground between the Python operation and the manual kernel, making it possible for C + + users to combine small operations into an efficient, large operation. This is an optimization option that is worth considering.

Mix a variety of styles

We have compared the various programming models, and the next question is how to choose. Before we discuss it, we must emphasize that the comparisons we have made here may not have much impact on the problems you face, mainly depending on your problem.

Remember Amdahl's law, if you take the time to optimize the unimportant parts, the overall performance is unlikely to be significantly improved.

We find that there is usually a trade-off between efficiency, flexibility and engineering complexity. Often different programming patterns apply to different parts of the problem. For example, imperative programs are more appropriate for parameter updates, and symbolic programming is a gradient calculation.

This article advocates a mix of styles. Recalling Amdahl's law, sometimes we hope that the flexibility of this part of the performance requirements may not be high, so simple to support more flexible interface also can not be. In machine learning, integrating multiple models often works better than a single model.

If each programming model can be blended in the right way, we get a better result than a single model. Let's list some possible discussions here.

Symbolic and imperative programs

There are two ways to mix symbolic and imperative programs.

    • Use imperative programs as part of symbolic program invocation.
    • The symbolic program is used as part of the command-type program.

We have observed that it is often easier to write parameter updates in an imperative way, while gradient calculations are more efficient using symbolic programs.

Mixed-mode programs can also be found in the current symbolic library, because Python itself is imperative. For example, the following code incorporates the symbolic program into the numpy (imperative).

A = Variable (' a ') B = Variable (' b ') c = b * AD = c + Constant (1) # compiles the functionf = compile (d) d = f (a=np.ones), b= Np.ones (TEN) * d = d + 1.0

The idea is to compile the symbolic graph into a function that can be executed in a command, which is a black box for the user. This is like what we often do, write a C + + program and embed it in Python.

However, it is not ideal to use numpy as an imperative part because the memory of the parameter is placed in the GPU. A better approach would be to interact with GPU-enabled command libraries and compiled symbolic functions, or to include a small subset of code in the symbolic program to help with the parameter update functionality.

Small operation and large operation

Combination of small operations and large operations can also be achieved, and we have a good reason to support this. Imagine such an application, such as replacing a loss function or adding a user-defined layer to an existing structure, it is common practice to combine existing parts with large operations and add new ones with small operations.

Recalling Amdahl's law, these new components are often less likely to be computational bottlenecks. Due to the critical parts of performance we have already optimized in large operations, these new small operations are not optimized to accept, or do some memory optimization, rather than the optimization of Operation Fusion.

Choose your own style

We have compared some of the styles of deep learning programming. The purpose of this article is to list these choices and compare their strengths and weaknesses. Not once and for all, it doesn't prevent you from keeping your own style, or combining the styles you like to create more interesting, intelligent, deep learning libraries.

Participate in note making

This note is part of our open source system design note for our deep learning library. You are welcome to submit your application, together to contribute to this note.

Original link: Programming Models for Deep learning (translator/Zhao reviewer/Liu Diwei, Zhu Zhengju, Li Zijian)

Thank Li Jing God (Weibo: @ li Mu m) for the final confirmation of this translation.

Extended Reading

@antinucleon Chinese blog post parsing mxnet technical features

Li Yu's interpretation of mxnet in the knowledge of:

Mxnet is the next generation of Cxxnet, now realizes all cxxnet functions, but draws on the Minerva/torch7/theano and adds more new functions.

    1. Ndarray programming interface, similar to Matlab/numpy.ndarray/torch.tensor. The unique advantage is that the engine behind it can be optimized for performance and memory usage.
    2. Symbolic interface. This allows for rapid construction of a neural network, and automatic derivation.
    3. More binding is currently supported by Python, which will soon have Julia and R.
    4. More convenient multi-card and multi-machine operation.
    5. Better performance. Currently mxnet is 40% faster than cxxnet, and GPU memory usage is half as low.

At present, Mxnet is still in rapid development. This month's main direction is three, more binding, better documentation, and more applications (language model, voice, machine translation, video). Address in Dmlc/mxnet Welcome to GitHub.

(Zebian/Zhou Jianding)

This article for CSDN compilation, without permission not reproduced, if necessary reprint please contact market#csdn.net (#换成 @)

Mxnet Design Notes: A comparison of programming patterns in deep learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.