Ternary weight Networks

Source: Internet
Author: User
Tags float double

Introduction

These two days to read this article, I would like to share here, but I still only record a blog with other people do not, or their own ideas (PS: Because sometimes when the blog found that each article is the same or very annoying =). In order not to repeat the work of predecessors, I post an accidentally turned to the blog weight value Simplification (1): Three-valued neural network (ternary Weight Networks), the whole thesis content and implementation are all very comprehensive, you can read it, I also learn from it.
The main points of work in this article are in three areas:

    • Increased the network's expressive power (expressive ability). Added $\alpha$ as scaled factoron the basis of {1,0,1};
    • Compresses the model size. Of course the main is weight compression. There is a 16~32x elevation compared to FPWN (fullprecision Weight network), but the 2x size of BPWN (binary Precision Weight network) ( PS: Of course, in the TWN Caffe code, all are stored by the float double type, because this needs to be implemented in the aspect of the should.
    • Reduce computing needs. Mainly compared to the bpwn increased by 0, of course, this aspect also need hardware to get promotion, in the Caffe code is not;
Ternary quantization

In my understanding, the core content of the text is: to be constrained and interdependent between the two variables of the optimization problem, and gradually split the final use of a priori statistical methods to solve myopia.
The initial optimization problem:

The $w^{t}$ constraints are materialized as:

and bring it into the formula (1), the optimization of $w^{t}$ is converted to $\delta^$ optimization:

Then the $\alpha$ in the formula (4) is biased to obtain:

Because $\alpha$ and $\delta$ depend on each other, the (5) substituting (4) is eliminated $\alpha$:

But the problem comes, the formula (6) still cannot be begged, and the text is based on a priori knowledge, assuming $w_i$ obey $n (0,\sigma^2) $ distribution, myopic $\delta^$ for $0.6\sigma$ ($0.6\sigma$ equals $0.75e (| w|) $). So the author uses a rude methodto set $\delta^ $ to $\delta^*\approx0.7e (| w|) \approx\frac{n}{0.7}\sum_{i=1}^n| w_i|$

//caffe-twns//blob.cppTemplate<TypeNameDtype>voidBlob<dtype>::set_delta () {floatScale_factor = Ternary_delta *1.0/Ten;//delta = 0.7Dtype Delta = (Dtype) scale_factor * This->asum_data ()/ This->count ();//0.7* (e| w_i|) /numDelta = (Delta <= -) ? Delta: -; Delta = (Delta >=-100) ? Delta:-100; This->delta_ = Delta;}Template<TypeNameDtype>voidBlob<dtype>::set_delta (Dtype delta) {delta = (Delta <= -) ? Delta: -; Delta = (Delta >=-100) ? Delta:-100; This->delta_ = Delta;}
Implement

I borrowed a picture.

Step to the other, where the 5th step code is above:

Template<TypeNameDtype>voidBlob<dtype>::ternarize_data (Phase Phase) {if(phase = = RUN) {//if (DEBUG) Print_head (); //log (INFO) << "RUN phase ..."; //Caffe_sleep (3); return;// do nothing for the running phase}Else if(phase = = TRAIN) {//log (INFO) << "TRAIN phase ..."; //Caffe_sleep (3);}Else{//log (INFO) << "TEST phase ..."; //Caffe_sleep (3);}//const Dtype delta = 0;//default value;  //const Dtype delta = (Dtype) 0.8 * This->asum_data ()/This->count ();   This->set_delta ();//defualt 0.7* (e| w_i|) /num or set by user  ConstDtype Delta = This->get_delta (); Dtype Alpha =1;if(!data_) {return; }Switch(Data_->head ()) { Casesyncedmemory::head_at_cpu:{caffe_cpu_ternary<dtype> ( This->count (), Delta, This->cpu_data (), This->mutable_cpu_binary ());//quantized weight to ternaryAlpha = Caffe_cpu_dot ( This->count (), This->cpu_binary (), This->cpu_data ());//scale-alpha: (E |   w_i|) I belong to I_deltaAlpha/= Caffe_cpu_dot ( This->count (), This->cpu_binary (), This->cpu_binary ());//(1/num_binary) *alphaCaffe_cpu_scale ( This->count (), Alpha, This->cpu_binary (), This->mutable_cpu_binary ());//This->set_alpha (alpha);}return; CaseSYNCEDMEMORY::HEAD_AT_GPU: Casesyncedmemory::synced:#ifndef cpu_only{caffe_gpu_ternary<dtype> ( This->count (), Delta, This->gpu_data (), This->mutable_gpu_binary ()); dtype* PA =NewDtype (0); Caffe_gpu_dot ( This->count (), This->gpu_binary (), This->gpu_data (), PA); dtype* PB =NewDtype (0); Caffe_gpu_dot ( This->count (), This->gpu_binary (), This->gpu_binary (), PB); Alpha = (*PA)/((*PB) +1e-6); This->set_alpha (Alpha); Caffe_gpu_scale ( This->count (), Alpha, This->gpu_binary (), This->mutable_gpu_binary ());//This->set_alpha ((Dtype) 1);    //LOG (INFO) << "alpha =" << alpha;    //Caffe_sleep (3);}return;#elseNo_gpu;#endif   CaseSyncedmemory::uninitialized:return;default: LOG (FATAL) <<"Unknown syncedmemory head State:"<< Data_->head (); }}

Step 6~7, where the author in the 6th step uses the traditional Caffe method directly in Caffe-twns, while $z=xW\approx x(\alpha w^t) = (\alpha x) \bigoplus w^t $ More biased with hardware-accelerated optimizations (since the ternary itself in the caffe-twns takes a float or double, and is accelerated with Blas or CUDNN and cannot skip the 0 value directly):

//conv_layer.cppTemplate<TypeNameDtype>voidCONVOLUTIONLAYER&LT;DTYPE&GT;::FORWARD_CPU (Constvector<blob<dtype>*>& Bottom,Constvector<blob<dtype>*>& top) {//Const dtype* weight = This->blobs_[0]->cpu_data ();if(BINARY) { This->blobs_[0]->binarize_data ();}if(ternary) { This->blobs_[0]->ternarize_data ( This-&GT;PHASE_);//quantized from blob[0] to ternary sand stored in cpu_binary ()/*Dtype alpha = (Dtype) this->blobs_[0]->get_alpha ();for (int i=0; i<bottom.size (); i++) {blob<dtype>* Blob = bottom[i];Caffe_cpu_scale (Blob->count (), Alpha, Blob->cpu_data (), Blob->mutable_cpu_data ());}*/}Constdtype* weight = (BINARY | | Ternary)? This->blobs_[0]->cpu_binary (): This->blobs_[0]->cpu_data ();..}

Step 11~19,weight Update is on the full precision , while the calculation gradient is ternary weight :

//conv_layer.cppTemplate<TypeNameDtype>voidCONVOLUTIONLAYER&LT;DTYPE&GT;::BACKWARD_CPU (Constvector<blob<dtype>*>& Top,Constvector<BOOL>& Propagate_down,Constvector<blob<dtype>*>& bottom) {Constdtype* weight = This->blobs_[0]->cpu_data (); dtype* Weight_diff = This->blobs_[0]->mutable_cpu_diff (); for(inti =0; I < top.size (); ++i) {...if( This->param_propagate_down_[0] || Propagate_down[i]) { for(intn =0; N < This->num_; ++n) {//Gradient w.r.t. Weight. Note that we'll accumulate diffs.        if( This->param_propagate_down_[0]) { This->weight_cpu_gemm (Bottom_data + n * This->bottom_dim_, Top_diff + n * This->top_dim_, Weight_diff); }//Gradient w.r.t. Bottom data, if necessary.        if(Propagate_down[i]) { This->backward_cpu_gemm (Top_diff + n * This->top_dim_, weight, Bottom_diff + n * This-&GT;BOTTOM_DIM_); }      }    }  }}

Ternary weight Networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.