Here is an article about network compression in 2017ICLR Openreview, "Training compressed fully-connected Networks with a density-diversity penalty."
Read the title of the article to know that the main is for the full link layer, so my goodwill fell by half. ———————— Introduction ————————
The author takes Vgg to say that the whole connection layer will occupy a lot of resources, compress this most important. It seems that there is no t_t (can Shan the accumulation layer is severe).
The article presents two nouns, which I find very interesting: "Density" and "diversity". These two nouns basically elicit the most existing compression methods for the depth model.
"Density" leads to a more representative approach is pruning, matrix decomposition, etc., that is, reduce the network sparsity (redundancy), so that the model is compressed.
"Diversity" leads to a more representative method is a quantitative method, with a small number of code words to represent a large weight matrix, that is, reduce the diversity of network parameters, so you can only store these different code words, thus compressing the model.
As a result, the article adds the density and diversity of the whole connection layer to the loss to punish, with the intention of making the network more sparse and diverse.
And this is one of my favorite explanations for this article: the author punishes the density and diversity of the whole-link layer in loss in order not to get a small model directly, but to better use pruning and quantization methods (reference links) on this basis. This is because the more sparse the network, the more we can cut the branches, the less the diversity of parameters, we can use less code words to quantify.
PS. I've seen several articles before. The width and depth of the network are also added into the loss, and the intention is to train the model with good performance and small volume. I am not very good at this kind of article, because of high complexity, but also bring a lot of training difficulties. We can easily compress to get a small model, but it is very difficult to train a small model directly. ———————— method: Loss Function ————————
The method is in fact divided into two steps: First, a sparse and low diversity network is trained through the loss function below, and then the network is compressed using pruning and quantization methods.
As you can tell, the current loss contains three parts: \ (L (\hat y,y) \) represents the predictive error; \ (\vert w_j\vert_p\) is the P-norm of the weighting matrix, used to describe the density; W_j (a,b)-w_j (a ', B ') |\) is used to describe the multiplicity of weight matrices. In addition, \ (\lambda_j\) is used to adjust the weight of each layer density and diversity loss, mainly to balance the number of each layer of the number of parameters caused by loss changes in the order of magnitude.
In reverse derivation, diversity loss \ (| W_j (a,b)-w_j (a ', B ') |\) The derivation of the calculation is too large, the author proposed a quick way, here is no longer to repeat, have the interest to see the original.
Training, the author also explained that this loss can lead to poor model performance, so the author took a step-by-step iterative approach. In the following diagram, the density and diversity loss are added first (the final model will be very accurate), then remove the loss and keep the fully connected layer template unchanged and then train the other layers to repeat the step.