Introduction
These two days to read this article, I would like to share here, but I still only record a blog with other people do not, or their own ideas (PS: Because sometimes when the blog found that each article is the same or very annoying =). In order not to repeat the work of predecessors, I post an accidentally turned to the blog weight value Simplification (1): Three-valued neural network (ternary Weight Networks), the whole thesis content and implementation are all very comprehensive, you can read it, I also learn from it.
The main points of work in this article are in three areas:
- Increased the network's expressive power (expressive ability). Added $\alpha$ as scaled factoron the basis of {1,0,1};
- Compresses the model size. Of course the main is weight compression. There is a 16~32x elevation compared to FPWN (fullprecision Weight network), but the 2x size of BPWN (binary Precision Weight network) ( PS: Of course, in the TWN Caffe code, all are stored by the float double type, because this needs to be implemented in the aspect of the should.
- Reduce computing needs. Mainly compared to the bpwn increased by 0, of course, this aspect also need hardware to get promotion, in the Caffe code is not;
Ternary quantization
In my understanding, the core content of the text is: to be constrained and interdependent between the two variables of the optimization problem, and gradually split the final use of a priori statistical methods to solve myopia.
The initial optimization problem:
The $w^{t}$ constraints are materialized as:
and bring it into the formula (1), the optimization of $w^{t}$ is converted to $\delta^$ optimization:
Then the $\alpha$ in the formula (4) is biased to obtain:
Because $\alpha$ and $\delta$ depend on each other, the (5) substituting (4) is eliminated $\alpha$:
But the problem comes, the formula (6) still cannot be begged, and the text is based on a priori knowledge, assuming $w_i$ obey $n (0,\sigma^2) $ distribution, myopic $\delta^$ for $0.6\sigma$ ($0.6\sigma$ equals $0.75e (| w|) $). So the author uses a rude methodto set $\delta^ $ to $\delta^*\approx0.7e (| w|) \approx\frac{n}{0.7}\sum_{i=1}^n| w_i|$
//caffe-twns//blob.cppTemplate<TypeNameDtype>voidBlob<dtype>::set_delta () {floatScale_factor = Ternary_delta *1.0/Ten;//delta = 0.7Dtype Delta = (Dtype) scale_factor * This->asum_data ()/ This->count ();//0.7* (e| w_i|) /numDelta = (Delta <= -) ? Delta: -; Delta = (Delta >=-100) ? Delta:-100; This->delta_ = Delta;}Template<TypeNameDtype>voidBlob<dtype>::set_delta (Dtype delta) {delta = (Delta <= -) ? Delta: -; Delta = (Delta >=-100) ? Delta:-100; This->delta_ = Delta;}
Implement
I borrowed a picture.
Step to the other, where the 5th step code is above:
Template<TypeNameDtype>voidBlob<dtype>::ternarize_data (Phase Phase) {if(phase = = RUN) {//if (DEBUG) Print_head (); //log (INFO) << "RUN phase ..."; //Caffe_sleep (3); return;// do nothing for the running phase}Else if(phase = = TRAIN) {//log (INFO) << "TRAIN phase ..."; //Caffe_sleep (3);}Else{//log (INFO) << "TEST phase ..."; //Caffe_sleep (3);}//const Dtype delta = 0;//default value; //const Dtype delta = (Dtype) 0.8 * This->asum_data ()/This->count (); This->set_delta ();//defualt 0.7* (e| w_i|) /num or set by user ConstDtype Delta = This->get_delta (); Dtype Alpha =1;if(!data_) {return; }Switch(Data_->head ()) { Casesyncedmemory::head_at_cpu:{caffe_cpu_ternary<dtype> ( This->count (), Delta, This->cpu_data (), This->mutable_cpu_binary ());//quantized weight to ternaryAlpha = Caffe_cpu_dot ( This->count (), This->cpu_binary (), This->cpu_data ());//scale-alpha: (E | w_i|) I belong to I_deltaAlpha/= Caffe_cpu_dot ( This->count (), This->cpu_binary (), This->cpu_binary ());//(1/num_binary) *alphaCaffe_cpu_scale ( This->count (), Alpha, This->cpu_binary (), This->mutable_cpu_binary ());//This->set_alpha (alpha);}return; CaseSYNCEDMEMORY::HEAD_AT_GPU: Casesyncedmemory::synced:#ifndef cpu_only{caffe_gpu_ternary<dtype> ( This->count (), Delta, This->gpu_data (), This->mutable_gpu_binary ()); dtype* PA =NewDtype (0); Caffe_gpu_dot ( This->count (), This->gpu_binary (), This->gpu_data (), PA); dtype* PB =NewDtype (0); Caffe_gpu_dot ( This->count (), This->gpu_binary (), This->gpu_binary (), PB); Alpha = (*PA)/((*PB) +1e-6); This->set_alpha (Alpha); Caffe_gpu_scale ( This->count (), Alpha, This->gpu_binary (), This->mutable_gpu_binary ());//This->set_alpha ((Dtype) 1); //LOG (INFO) << "alpha =" << alpha; //Caffe_sleep (3);}return;#elseNo_gpu;#endif CaseSyncedmemory::uninitialized:return;default: LOG (FATAL) <<"Unknown syncedmemory head State:"<< Data_->head (); }}
Step 6~7, where the author in the 6th step uses the traditional Caffe method directly in Caffe-twns, while $z=xW\approx x(\alpha w^t) = (\alpha x) \bigoplus w^t $ More biased with hardware-accelerated optimizations (since the ternary itself in the caffe-twns takes a float or double, and is accelerated with Blas or CUDNN and cannot skip the 0 value directly):
//conv_layer.cppTemplate<TypeNameDtype>voidCONVOLUTIONLAYER<DTYPE>::FORWARD_CPU (Constvector<blob<dtype>*>& Bottom,Constvector<blob<dtype>*>& top) {//Const dtype* weight = This->blobs_[0]->cpu_data ();if(BINARY) { This->blobs_[0]->binarize_data ();}if(ternary) { This->blobs_[0]->ternarize_data ( This->PHASE_);//quantized from blob[0] to ternary sand stored in cpu_binary ()/*Dtype alpha = (Dtype) this->blobs_[0]->get_alpha ();for (int i=0; i<bottom.size (); i++) {blob<dtype>* Blob = bottom[i];Caffe_cpu_scale (Blob->count (), Alpha, Blob->cpu_data (), Blob->mutable_cpu_data ());}*/}Constdtype* weight = (BINARY | | Ternary)? This->blobs_[0]->cpu_binary (): This->blobs_[0]->cpu_data ();..}
Step 11~19,weight Update is on the full precision , while the calculation gradient is ternary weight :
//conv_layer.cppTemplate<TypeNameDtype>voidCONVOLUTIONLAYER<DTYPE>::BACKWARD_CPU (Constvector<blob<dtype>*>& Top,Constvector<BOOL>& Propagate_down,Constvector<blob<dtype>*>& bottom) {Constdtype* weight = This->blobs_[0]->cpu_data (); dtype* Weight_diff = This->blobs_[0]->mutable_cpu_diff (); for(inti =0; I < top.size (); ++i) {...if( This->param_propagate_down_[0] || Propagate_down[i]) { for(intn =0; N < This->num_; ++n) {//Gradient w.r.t. Weight. Note that we'll accumulate diffs. if( This->param_propagate_down_[0]) { This->weight_cpu_gemm (Bottom_data + n * This->bottom_dim_, Top_diff + n * This->top_dim_, Weight_diff); }//Gradient w.r.t. Bottom data, if necessary. if(Propagate_down[i]) { This->backward_cpu_gemm (Top_diff + n * This->top_dim_, weight, Bottom_diff + n * This->BOTTOM_DIM_); } } } }}
Ternary weight Networks