Caffe Code Reading _caffe

Source: Internet
Author: User
Tags diff

Reproduced from:

Caffe code Reading-hierarchy-painless machine learning-Know the columns https://zhuanlan.zhihu.com/p/21796890

Caffe Source Reading--net Assembly-painless machine learning-Know the column https://zhuanlan.zhihu.com/p/21875025

Caffe code reading--solver-Painless machine learning-Know the column https://zhuanlan.zhihu.com/p/21800004

1.Caffe code Reading-hierarchy structure Author: Feng
Link: https://zhuanlan.zhihu.com/p/21796890
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.

Caffe is an excellent deep neural network open source software, below we talk about its source code and its implementation. Caffe's code is generally readable and architecturally clear, and reading code is not a very difficult task. But before you read the code, there are two questions to answer: what is the reading code for?
Read to what extent. (This problem is actually related to the previous question)

The reading code generally has the following objectives: Figuring out the algorithm or function that the code implements. Not very understanding of the algorithm itself, hope to read the code to understand the algorithm.
Figure out the details of the code during the implementation of the algorithm. In this case, a general understanding of the algorithm has been, read the code is to understand the details of the algorithm in the code considerations. Of course, if you want to use code, it's helpful to know the details of the code.
Extension code. On the basis of open source code, use the existing framework, add or modify functions, to achieve their desired functions. This requires a more in-depth understanding of the architectural details of the code.

Our goal is to extend the code. The main extension points in the Caffe are layer and solver, and of course the rest can be extended, but there will be more code to be changed.

When the first question is identified, here is the second question. Read the code to what extent. In general, I think that reading the code can be represented by a logistic function:

On this diagram, the horizontal axis is the time it takes to read the code, and the vertical axis is the result of reading the code. For the larger code than the project, the first reading must be Mongolian, need to spend a certain amount of time to comb out the various documents, the relationship between the modules. As the structure of the relationship becomes clearer, the reader begins to understand the meaning expressed in the code, and the effect of reading the code goes straight up. However, when we understand the main line of code and the main lines, then read some of the small feeder is not too much profit. So based on the price/performance of the reading Code and the characteristics of the Caffe code itself, we will only read the main line and some important branch lines, which is estimated to be half the total amount of code. The main structure abstraction of Caffe code

Unlike some other frameworks, Caffe is not written in symbolic computing mode, and the overall architecture is based on system-level abstraction. The so-called abstraction, is to encapsulate some of the details of the problem, so that the upper layer of the code become clearer. So let's take a look at the Caffe's abstract hierarchy to see the main structure of Caffe:

Syncedmem: The main function of this class is to encapsulate data interaction between CPU and GPU. In general, the flow of data are: Hard disk->CPU memory->gpu memory->CPU memory-> (hard disk), so in the process of writing code will often write CPU/GPU between the data transfer code, Also maintain memory pointers for CPU and GPU two processing ends. These things are not difficult to deal with, but they are cumbersome. Therefore, the appearance of Syncedmem is to cpu/gpu the data transmission operation, only need to invoke a simple interface to get two processing end of the synchronized data.

Blob: This class does two encapsulation: one is the encapsulation of the operational data. Using BLOBs here, we can manipulate high-dimensional data, quickly access the data in it, transform the dimensions of the data, and so on, and encapsulate the original data and the amount of updates. Each blob has data and diff two pointers, which are used to store the original data, and diff is used to store the inverse. Updates the value to the propagated gradient. BLOBs use Syncedmem, which also facilitates access to different processing ports. This blob basically implements the entire Caffe data part structure encapsulation, in net class can see all the forward and backward data and parameters are represented by a blob is enough.

The abstraction of the data to this is OK, followed by the abstraction of the hierarchy. As we have analyzed before, the neural network can be completely independent of the layer and layer, then each layer can ensure the correctness of the whole network as long as it is implemented according to certain interface rules.

Layer:caffe implements a basic hierarchy class Layer, and for some special categories there are also their own abstract classes (such as Base_conv_layer), which mainly take the form of template design (Template), which means that some of the necessary code is written in the base class, Some specific content is implemented in subclasses. In layer's setup, for example, several steps of Setup are included in the function, some of which are done by the base class, and some of the steps are done by subclasses. There are also very important forward and backward, the base class implements some of the logic that is needed, but the real part of the operation is given to subclasses. So when we need to implement a new layer, we do not need to manage trivial things, as long as the relationship between the level of initialization and forward and backward.

Net:net combines data and layers for further encapsulation, exposing both initialization and forward and backward interfaces, making the overall look and functionality similar to one layer, but the internal combination can be varied. At the same time, it is worth mentioning that the input and output data of each layer is uniformly stored in net, while the parameter pointers in each layer are also saved in net, and different layers can share the same parameters through Weightshare, so we can configure the function of sharing parameters among multiple neural network layers. This also enhances our imagination of the network structure.

Solver: With net we can actually do the forward and backward calculation of the network, but there is a lack of learning and training on the network, so on top of this, the Solver class encapsulates some of the functions related to training and forecasting. At the same time, it also opened two types of interface: one is to update the parameters of the interface, inherit solver can implement different parameter update methods, such as everyone loved Momentum,nesterov,adagrad. This makes it possible for different optimization algorithms to be applied. Another is the training process in each round of the specific state of the injection of some of the callback functions, in the code this callback point of the direct user is a multiple-card training algorithm.

IO: It's enough to have what's on it. Not enough, we also need to enter data and parameters, is the so-called Ching, no data are in vain. DataReader and Datatransformer help prepare the input data, filler initialize the parameters. Some snapshot methods help the model to persist so that the IO problem of the model and data is solved.

Multi-card: For single GPU training, the basic level of the relationship is over here, if you want to do more GPU training, then the upper layer will have Internalthread and p2psync two classes, these two classes belong to the top-level class, and they call only Solver and some parameter classes.

In fact, here, Caffe's main line is basically gone. We can draw a picture to show the overall hierarchical relationship of Caffe:

If you know more about this picture and the details of the diagram, your knowledge of Caffe should be good. Later on Caffe source analysis of the article can not read. If not, then we can continue to pay attention to it. Of course, if you want to really understand the meaning of this picture, or to really read the code, to understand some of the details. But there are some details here without a detailed analysis, the next time we will stand in the layer angle to see a layer in the training process of the full experience.
2. Caffe source code reading--net assembly

Recently busy to see Ti did not write the article in time, today hurriedly fill an article ... NET is a relatively core class in Caffe code, looking down it encapsulates all the layer, builds up the entire neural network, and looks up to it provides forward-backward computing and access to the core data structure, So that the solver can use net more easily to implement the train and test strategy. Of course, because of its importance, assembly NET is a more complex part. This time we will take a look at the content of net.

Of course, in the front, the net-assembled code has two purposes: to understand some of the problems that need to be considered as a mature CNN model framework;
If you want to extend your network structure, such as writing a new layer, how some of the data flows between layer and net

First, to make the problem less complicated, we first look at the key steps of the net assembly from the log output from the training model, and then slowly expand the process to understand all the details of the assembly. NET assembly in Log eyes

In order to better show some of the details of net assembly, we have selected a practical example here, which is the Siamese model in Caffe's examples. The details of the model here is not much to say, interested to see official or unofficial documents, here only a point: this network in addition to some other normal network features, but also has the characteristics of network parameters reuse, in the subsequent analysis we will use.

The next thing we want to see is the net-assembled log. This log is typically a large log that you flash across when you train your network, but if it stops without a flash, it may be that your network definition is faulty. This log content is more, overall is the train phase and test phase of two networks assembled. We focus on a few of these fragments to get a good idea of some of the core elements of net assembly, as well as those that are more worthwhile to print.

The first is a normal convolution layer conv1,log as follows (the line number of the following code may be different, but the position is similar):

LAYER_FACTORY.HPP:77] Creating layer conv1
net.cpp:92] Creating layer conv1
net.cpp:428] conv1 <-data
NET.CPP:402] conv1-> conv1
net.cpp:144] Setting up conv1
net.cpp:151] Top shape:64 (737280)
net. CPP:159] Memory required for data:3752192

The first line of this is the code that creates this layer instance, and the specific creation process is inside the layer_factory. In order to facilitate the creation of Layer,caffe, the design pattern of the factory method is used, as long as the layer name (the parameter in the configuration file is called type) can be instantiated to instantiate a layer according to the name and corresponding parameter. This part of the details as long as the careful look will understand.

3rd, line 4 shows the process of creating the bottom and top data for the current layer. This involves two methods of Appendbottom and Appendtop in Net.cpp, because each bottom blob and top blob has a name, where the relationship between them is exported.

Line 5th does not seem to have any dry goods, but it represents the layer's setup function that has been called to complete (or layer be share). Layer's setup function is a key function of layer initialization, which involves the following specific actions:

Checkblobcounts (bottom, top);
Layersetup (bottom, top);
Reshape (bottom, top);
Setlossweights (top);

In summary, the four sentences are complete: a check on the bottom blob, the number of top BLOBs, and the parent class implementation.
The initialization of layer internal correlation variables, implemented by specific subclasses
The dimension of the bottom blob passed in is determined, and layer needs to determine the latitude of the top blob based on the calculations that it wants to do. For example, this layer is a convolution layer, the dimension is 20*5*5, the input image is 1*28*28, that is, the bottom blob dimension, then input is 20*24*24, which is the result of the above log, but also added a batch size. This function is implemented by a specific subclass.
Initializes whether the layer is output loss and what the output loss to do. The parent class implementation. It must be said that the data set in the Caffe about the loss_weight,loss_,top.cpu_diff of the loss layer is somewhat round and somewhat trick.

All right, back to the log above. The next sentence tells us the dimension that the top layer should output. Here is the output of the dimension is to let the friends do not trust to calculate, to see if you think the same. Of course, the loop that outputs this log does not only do this, it's main job is to set the top Blob's loss_weight.

The last sentence calculates the memory occupied by the top blob of the layer. You can see that up to this level, the memory consumption is about 3M more, not big.

Well, this is the most typical layer initialization, and the following Relu layer is slightly different:

LAYER_FACTORY.HPP:77] Creating layer RELU1
net.cpp:92] Creating layer relu1
net.cpp:428] relu1
<-ip1 net.cpp:389] RELU1-> ip1 (in-place)
net.cpp:144] Setting up RELU1
net.cpp:151] Top shape:64 (32000)
n ET.CPP:159] Memory required for data:5769472

The most different of these is the end of line 4th (in-place), which shows that the bottom blob and top blob of Relu are the same data, which is the same as our definition in the network. The benefit of in-place is that it reduces memory operations, but here it does not take into account the savings that In-place brings when it comes to counting memory consumption.

The next step is the conv1_p of shared networks:

LAYER_FACTORY.HPP:77] Creating layer conv1_p
net.cpp:92] Creating layer conv1_p
net.cpp:428] conv1_p <-data _p
net.cpp:402] conv1_p-> conv1_p net.cpp:144
] Setting up conv1_p
net.cpp:151] Top shape:64 20 24 24 (7 37280)
net.cpp:159] Memory required for data:8721664 net.cpp:488
] Sharing parameters ' Conv1_w ' owned by layer ' Conv1 ', param index 0
net.cpp:488] sharing parameters ' Conv1_b ' owned by layer ' CONV1 ', param index 1

The most characteristic of this passage is the last two sentences "sharing", because the Siamese model has two networks with exactly the same parameters, the second network detects that the parameter name already exists at the time of the build, indicating that the layer's parameters are shared with the other layers, so print it out here to tell the user. Of course, the previously not printed content tells us that in fact the net class is also responsible for parameter-related initialization. This part of the content is actually quite a lot, in addition to the parameter sharing, there are parameters learning rate,weight decay settings.

The last is the most special layer: loss layer

NET.CPP:92] Creating Layer loss
net.cpp:428] Loss <-feat
net.cpp:428] loss <-feat_p net.cpp:428
] Loss <-Sim
net.cpp:402] loss-> loss net.cpp:144
] Setting up loss
net.cpp:151] top shape: (1)
net. CPP:154] with     loss weight 1
net.cpp:159] Memory required for data:10742020

This layer seems to be nothing special, it is the same as before, but the only difference is that its penultimate line, which shows that this layer is loss weight. As for what is the use of loss weight, we will say it in detail later. Here, briefly, there are loss weight that this blob will be used to compute loss.

The previous log mainly solves the network assembly and some forward calculation, from the log, we can see that net completes the following things: Instantiate layer
Create a bottom blob,top blob
Setup Layer (Initialize Layer, determine top blob dimension)
Determine the loss_weight of layer
Determines whether layer parameters are shared, not shared, creates a new

As can be seen from the above process, all the liquidity variables (bottom blob,top blobs) in the whole network are kept in net, and the parameters of each layer are marked according to the shared relationships of each layer. The advantage is to centrally manage the data in the network to facilitate the operation of the data.

Further down, we can intercept a small log to:

NET.CPP:220] pool1 needs backward computation.
NET.CPP:220] CONV1 needs backward computation.
NET.CPP:222] Slice_pair does not need backward computation.
NET.CPP:222] Pair_data does not need backward computation.
NET.CPP:264] This network produces output loss
net.cpp:277] network initialization done.

The next step is to count whether a hierarchy needs to be reversed. In general our layers need to be computed, but there will be layers that do not need to be computed, such as the data layer, like the log above, as well as the layers that are expected to be fixed, which is usually used when finetune the network. Because the inverse calculation is generally slower than the forward calculation, if there is no need to calculate the layer, the direct skip calculation can save time.

Finally, the output from the entire network is shown in the training iterations.

With this in view, we have a general understanding of the net load, and it will be easier to look at its code.

Finally, about the relationship between all the member variables in the net class and their relationships, we can use the following diagram to understand the good:

After the initialization of the net to understand, in fact, the architecture below the problem is not much. Next I look at more than net things, Solver and Caffe in the "simple" multi-card training.

3. Caffe code reading--solver before we talked about the content of net assembly, next we look at the content of Solver. The Solver body has two parts: initialization and training. The initialization content is relatively simple, here is not said; Let's talk about some of the key functions in training. Core function: Step

Real training in the step function, here are the key callback functions for multi-card training: On_Start () and On_gradient_ready (), the specific invocation method we'll talk about it later. There are two important processes in between the two callback functions: Forwardbackward and Updatesmoothedloss. After On_gradient_ready there is a key function applyupdate (), where the code is in Sgd_solver. Let's take a look at it in detail. Forwardbackward

Here, the main call to the code in net, the main completion of the forward backward calculation, forward to calculate the final output of the model and loss, after the calculation of each layer of network and parameters of the gradient. There is a detailed description of the forward and backward detail, and the only thing worth mentioning is the forward loss calculation, which is actually layer, specifically related to the loss_weight of this parameter initialization and loss (), as well as Loss_ The initialization of the layer in the Setup function. Updatesmoothedloss

This function is mainly to do loss smoothing. Because Caffe training is SGD, we cannot put all the data into the model for training, then some of the data generated by the loss may be different from the average loss of the full sample, the loss and the historical process of the updated loss average can reduce the oscillation problem loss. The smoothing method in the code is relatively simple, so we can know it.

Here is the Applyupdate function, which actually completes the task of updating the parameters. The Caffe parameter update utilizes only the gradient information of the model and does not utilize the second order information. The following is a detailed description of the next update parameters of several processes: Getlearningrate
Clipgradients
Normalize
Regularize
Computeupdatevalue

Getlearningrate

The story of learning rate we talked about earlier, and this is really a big problem in CNN training. Caffe to make the design of learning rate more flexible, it offers a series of learning rate schemes: FIXED:LR will never change.
Step
Exp:
Inv
Multistep: Directly write ITER in a certain range when LR should be how much
Poly
Sigmoid:

These programs have their pros and cons, choose their own good. Clipgradients

This step is mainly to make a limit on the gradient value, if the gradient value is too large, then the gradient will be done a trim, all the parameters multiplied by a scaling factor, so that all the parameters of the square and not more than the parameters set in the gradient total value. This function feels like a trust Region to the global function, which prevents the updated amount of the sophomore from causing the gradient to diverge. I think the idea of this step is good, but there may be problems in the actual operation. In fact, only some of the parameters of the gradient is relatively large, and the other parameters of the gradient itself is relatively small, then all the parameters multiplied by the same factor will make some smaller parameters are smaller, which will bring some unfairness. Normalize

This step also considered a number of single batch not enough to complete the training problem, by limiting the number of updates per batch to control the total update, the code is relatively simple. Regularize

At this point, we finally have to calculate the gradient of the regular term. Caffe provides two regular methods--l2 and L1, in which the standard gradient descent method is used, L1 the Sub-gradient calculation method. L2 optimization calculation is relatively simple, there is nothing to say, but the calculation of L1 is a bit worth pondering the place. There is no problem with the Sub-gradient method in fact, but the lasso optimization can have other methods, this problem can be further detailed chat. Computeupdatevalue

Here, we finally came to the last stop of the gradient calculation, at this time we finally completed the calculation of the gradient, the following should consider the combination of LR and gradient how to calculate the final gradient optimization value. The method of SGD is mainly based on momentum and gradient optimization. We have already talked about the advantages of momentum. In addition, Caffe also provides a series of gradient calculation methods, these optimization methods have their own characteristics, and later we can look at it slowly.

When this is done, we can call update in the BLOB to add the data and diff for each parameter to calculate the final result. In this way, the entire optimization process is complete. As for the rest of the content is not the core process, it is omitted to read.

If we adopt the strategy of single card training, then the reading code here is almost the same. However, multi-card training for large-scale training tasks is essential, so we then strike the hot to see Caffe training. Multi-card training algorithm

The general idea of Caffe training algorithm is data parallelism, we use different GPU to process different data, and then all gradient update Rollup. Since Solver has given two callback functions in training, multi-card training uses these two callback functions mainly: On_Start (): Copies the parameters to each GPU.
Forwardbackward (): Each GPU calculates its own forward backward results.
On_gradient_ready (): The reverse gradient is aggregated together. Applyupdate (): Parameter update on a rolled-up thread

The 2nd step is completed by each CPU thread and its own GPU, the 4th step is completed by the total CPU and its GPU, the remaining 1, 32 steps is the task of data transfer, and it is also the main part of multiple card calculation.

Caffe uses a tree structure for parameter passing, with one CPU thread and GPU as the root of the tree structure, others as nodes under the root. In order to transfer GPU data more quickly, the construction of tree structure should consider whether the GPU is similar to each other, for example, whether the two GPU can be peer-to-peer through balls. In the previous translation blog We have talked about the data transfer between the GPU problem, the tree structure here is mainly to consider this.

Let's assume the topology of the 4-piece GPU is as follows:

Nvidia-smi topo-m
       GPU0   GPU1   GPU2 GPU3 GPU0   X     PHB    Soc    Soc    
GPU1   PHB    X     Soc    
GPU2   Soc Soc    x     PHB   
GPU3 Soc Soc    PHB    x     

So the tree structure that we construct is as follows: The data transfer is also transmitted according to this structure.

This 1,3 data transmission is resolved, the specific process please read the code in detail, here is not described.

The basic introduction to Caffe code here, we have a clearer understanding of the overall structure of the code, we will analyze the characteristics of the various parts of the model.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.