Parsenet and some of the articles mentioned in the feature fusion before the L2norm, because the multilayer feature map activation value size distribution gap is relatively large, if not through the norm, the activation value is generally larger feature map Have a greater impact on the results of the fusion. Brief Introduction L2norm Formula
BP derivation of l2norm reverse propagation
Caffe Realization of L2norm
But the internet is difficult to find L2norm related source code, Caffe itself also did not realize, roughly looked at the GitHub on a lot of l2norm implementation is not perfect, for example, did not provide across spatial:false/true two different modes, after searching, found that in fact Parsenet's author Wei Liu Great god in his own GitHub released L2norm official source code, but the naming and storage location is not very easy to find.
Wei Liu Great god to achieve the L2norm source can be seen here, its name is Normalize_layer, corresponding. cpp,. cu,. hpp files can be found in the corresponding location. In addition, due to the Wei Liu Great God's SSD article also uses the L2norm, therefore may find the Noramlize_layer use example in the SSD prototxt, see here. Source Analysis Caffe.proto Analysis
Message Normalizeparameter {
optional bool across_spatial = 1 [default = True];
//Initial value of scale. The Default is 1.0 to all
optional fillerparameter Scale_filler = 2;
//Whether or not scale parameters shared Across channels.
Optional BOOL channel_shared = 3 [default = True];
Epsilon for isn't dividing by zero while normalizing variance
Optional float EPS = 4 [default = 1e-10];
There are two very important parameters in this, across_spatial and channel_shared. Accross_spatial determines the scope of the normalization, and if True (the default), normalization each num (channel*height*width) as a whole. That is, the number of the square tax of Xi above is channel*height*width, if false, it means that normalization is not accross_spatial, the number of add and is channel, that is to say, Each pixel in spatial (height*width) is normalization separately, which greatly reduces the range of normalization. Channel_shared: After the normalization of the above, multiply the top_data by a scale (this scale is the only parameter of Normalization_layer), if channel_shared is true (default), Then all channel of Top_data are multiplied by the same number, and if channel_shared is false, then the number of channel multiplied by Top_data is different.
As for other CPP,CU file implementation details can go to see for themselves, can also see reference 1 and Reference 2. Use summary
Recently spent two weeks on layer Norm, Instance Norm, L2norm, local Response Norm, Group Norm in the use of feature fusion, always up to say, the effect is not good, And feel norm after the scale layer initialization parameters of the setting of the result is very large, difficult to tune, basically is a failure to reproduce, it is better to simply use eltwise sum or concatenation without any norm direct integration of a number of feature map effect. The great God who made it might as well be in the comments, or we could communicate with each other more.