Original address: http://www.sohu.com/a/198477100_633698

The text extracts from the vernacular depth study and TensorFlow

With the continuous research and attempt on neural network technology, many new network structures or models are born every year. Most of these models have the characteristics of classical neural networks, but they will change. You say they are hybrid or variant, in short, the various methods of neural network innovation that is called big open brain hole.

These changes usually affect the performance of these networks in some branch areas or scenarios (although we expect the generalization of the network to be good in all areas). The Deep Residual network (Deep residual network) is a representative of many variants, and it does work well in some areas, such as target detection (object detection).

Application Scenarios

For the traditional deep Learning network applications, we all have the experience that the deeper the network can learn more things. Of course the speed of convergence is also slower, the longer the training time, however, the depth to a certain extent will find some of the deeper learning rate is lower. The design of the deep residual network is to overcome the problem that the learning rate is low and the accuracy rate cannot be improved effectively because of the depth of the network, also known as the degradation of the network. Even in some scenarios, the increase in the number of layers in the network reduces the correct rate.

About the depth of residual network information is not much, at least compared to the traditional BP, CNN, RNN network of more than the introduction of information is much less, my side is a reference to Mr. He Chaiming on the Internet to open an introductory information "Deep residual networks--deep Learning Gets Way Deeper ".

When compared to the traditional convolution neural network in the classifier, in the process of deepening the network layer is to observe some unexpected phenomenon. For example, a network of 56-layer 3x3 convolution cores on a CIFAR-10 project is more embarrassing than a 20-layer convolution network for both the training set and the validation set. Usually in order to let the network learn more things, it is possible to deepen the network layer, so that the network has a higher level of VC-dimensional such means to achieve. But the fact is that it's worse than the 20 level when you add it to the 56 floor.

The essential problem of this kind of phenomenon is the problem of the fitting that arises from the information loss. These pictures in the deep layer of the network after the sampling of a number of strange phenomenon, that is clearly different picture categories, but produced a seemingly more approximate to the network's stimulus effect. The reduction of this gap also makes the final classification effect will not be too ideal, so the solution should be to try to make their introduction of the difference of these stimuli and solve the generalization ability. That's why we'll consider changing the structure of the traditional CNN network on a larger scale, and the results have not disappointed us, and the excellent features of the new deep residual network in image processing really brighten our eyes.

So far, the depth residual network has been shown to have good effect in the fields of image classification (images classification), object detection, semantic segmentation (semantic segmentation). The above graphs are all attempts to use the depth residual network in a picture to identify a specific target, each target attribute annotation is based on Microsoft's Coco data set of data identification. There is also a decimal point in the block diagram of the object (person), which is the probability (or the degree of certainty precision), which means that the model recognizes the true reliability of the type of object. We can see that most of the objects in this picture are very accurate in their recognition.

Structural interpretation and mathematical derivation

The big problem with the depth network is that the gradient dispersion and the gradient explosion can easily occur as the depth deepens.

We have also mentioned this problem, because the network depth is too large, the process of residual propagation in the layer and layer in the process of derivation will be multiplied and superimposed, a number less than 1 or one greater than 1 will become large or small by the exponential superposition of 150 layers, and we can calculate it by hand. 0.8 of the 150-time party is about 1.2 of the 150, both of which are extremely serious disasters, and any one can lead to training in vain.

In the traditional flat network (Plain network), a layer of network data source can only be the previous layer of network, as shown above, the data layer by level downward flow. For convolution neural networks, each layer after the convolution kernel will produce a similar effect of lossy compression, it is conceivable that in the lossy compression to a certain extent, it is not clear that the original clearly discernible two photos is not an accident. This behavior is called lossy compression is not really appropriate, in fact, in engineering we call drop sampling (downsampling)-that is, in the process of vector through the network through a number of filters (filters) processing, the effect is to let the input vector in the sample processing by reducing after the smaller size, A common feature in convolution networks is the convolution layer and the pool layer, both of which can serve as the function attribute of the reduced sampling. The main purpose is to avoid the fitting, as well as to reduce the computational volume of the side effects. In the deep residual network, the structure appears to have a more obvious change.

In this case, this kind of "short-circuit" design is introduced, and the data output from the previous layers is skipped over the multi-layer and introduced into the input part of the subsequent data layer, as shown in the figure. What effect will it have? Simply put, the more "clear" vector data in the front layer will be used in conjunction with the further "lossy compression" data as a subsequent input. Compared to the previous flat network without this "short-circuit" design, the lack of reference to this part of the data is itself a phenomenon of missing information. Originally a mapping relationship made up of a 2-layer network, we could call it a desired function of F (x) to fit, and now we expect to fit with H (x) =f (x) +x, which in itself introduces richer reference information or richer dimensions (eigenvalues). So the network can learn more rich content.

This diagram compares the depth and structure of the three networks, VGG-19, 34-tier "flat networks"-the average 34-story CNN Network, and a 34-layer deep residual network.

In the design of the deep residual network, it is usually a kind of "strive for simplicity" design method, just to deepen the network, all the convolution layer uses the 3x3 convolution kernel, and never design any full connection layer in the hidden layer, nor consider using any dropout mechanism in the course of training. Take 2015 Years of ILSVRC & COCO Copetitions As an example, the classification for the purpose of the depth of residual network "Imagenet classfication" can reach 152 layers of depth, is also broken record.

The introduction of this short-circuit layer will have an interesting phenomenon, that is, will produce a very smooth forward transfer process. We see that the relationship with the first layer of XL is purely a linear superposition relationship. If further derivation and subsequent layers of output can be found after the expansion is such an expression:

That is, the content of any of the following XL vectors will be partially contributed by a layer of XL in front of it.

OK, now look at the reverse of the residual transmission, but also a very smooth process. This is the function expression of a layer output XL that we saw just now

So the residuals we define as E (is loss), should have

The following xlable represents the ideal vector value for a layer of XL in the current sample and in a given case, and this residual is to indicate that it is OK. The following is a cliché of the derivation process, here is the chain of law can be directly out, very simple

Notice this place, explain it in the vernacular. The residuals generated by the output XL on any layer can be passed back to the XL on any layer in front of it. The process of passing is very "fast" or "direct", so it does not appear to have obvious efficiency problems when the number of layers becomes more numerous. And there is a noteworthy place, and the following

It can make

is a linear superposition process rather than a multiplication, so it is not likely to have a gradient disappearing phenomenon. These are the mathematical derivation that explains why the depth of the network of deep residuals can be so deep that it does not have the dreaded gradient disappearing problem and training efficiency problem.

To add to the explanation,

E and XL here refer to the relationship between two different layers, referring to their residuals and output values. Please note that in a multi-tiered network, each layer can be considered a classifier model, except that the specific classification of each layer of classifiers makes it difficult for humans to find exact and convincing physical explanations. However, each layer of neurons in fact acts as a classifier, which samples the vectors from the previous layer and maps them to the new vector space distribution. So to explain it from this point of view,

It is no problem to refer to any network fragment that is "taken out of context", nor does it emphasize that the loss function must have passed from the last layer to the previous layer.

Topology interpretation

In addition to the explanation we mentioned earlier on the expression of functions based on the network, the deep residual network has a strong learning capability, good performance and an explanation, and we visualize this explanation.

Short-circuit items are equivalent to a short connection to all of a network like the one above, and these short segments of the past actually form a new topological structure.

For example, just F1, F2, F3 these three networks after the short connection actually evolved into the right such a topological structure, we can clearly see that this is equivalent to a number of different network models do fusion or parallel. Feedback the structure to the back by using the preceding vector information through several different classifier models. And there's no change. Only the bottom of the series structure, the difference between the two models is the key to their ability to learn different.

The special point in the residual network is that the structure is just the one that has the shortcut part of the unit. The author finds that it provides two different shortcut units for use in the framework of Keras, one with convolution items and one without convolution items.

Here is a keras,keras is also a very good depth learning framework, or "shell" more appropriate. It provides a more concise interface format that enables users to implement many model descriptions in very, very short code. Its back end supports the TensorFlow and Theano two frameworks as background implementations (backend). TensorFlow in the description of a very complex process, can be encapsulated in the Keras very good, so in the actual work of the author also often use Keras "parcel" tensorflow to do the project, code readability will be much better. You are interested to have a try, and the later appendix of this book also provides the Keras installation documentation for your reference.

GitHub sample

On the implementation of the deep residual network many people have uploaded the GitHub, and here we have tried some versions, such as:

Https://github.com/ry/tensorflow-resnet

https://github.com/raghakot/keras-resnet/blob/master/resnet.py

The former can also download a pretrained model from the Internet, all of which are model-makers who use some data sets to train their models. We can think of it as a "semi-finished product" with certain discernment ability.

In your own application scenario, you can continue to train these "semi-finished" items as needed after you initialize them, making them more adaptable to their assigned scenes. This way in the project also many see, after all, to get someone else this "semi-finished" level also have to spend a lot of human costs and time costs.

Mr. He Chaiming himself also disclosed a way to implement the address in Https://github.com/KaimingHe/deep-residual-networks, but only on the Caffe, Interested readers who want to study the Caffe framework can make a reference. The specific code we will not start to detail.

Summary

It should be said that the invention of the residual network is another beneficial attempt to the network connection structure, and the actual effect is really good. Someone once asked me if the depth residual network would not skip over two convolution layers with a single shortcut, but skip over 1 or 3 or other numbers to see what happens.

The question is difficult to answer, but the question itself is not meaningless.

First of all, skip 1 or 3, each different link is a new network topology, with different classification capabilities. Because the structure of neural network itself is very complex, it is difficult to discuss two kinds of learning ability with different topological structure directly after the change of topological structure. However, one thing is certain, that is, the network is similar to the "parallel" situation is to improve the ability to learn the network itself. As to which scene, how much ability to improve, need to try and contrast in the experiment, so summed up some new theoretical results. So in fact, we can not rule out skipping 1 or 3 layers to do short connection in other areas of the classification will have better results, which requires a specific experiment and demonstration process. Now every year in the international on the network structure adjustment of the paper is based on some experiments and summed up the theory, although most of the major breakthroughs, but the science of this thing always have a quantitative change process.

I think that, in the process of work, we pay more attention to some of the latest international papers and experimental results, on the one hand can be in the grasp of the theory on the basis of bold to put forward some new ideas and try to prove. This is also an encouraging research attitude and a good way to get experience.

Editor's note: This book extracts from book "Vernacular depth Learning and TensorFlow", the book in the "civilian" starting point, from "0" began with the original intention to introduce the depth of learning techniques and skills, bedding, the calculus, gradient and other knowledge of the focus of the fragmented, the learning curve to the maximum level Let the reader have a good sense of generation and closeness.

"The deep learning and tensorflow in vernacular"