Ternary weight Networks

Last Update:2018-02-03 Source: Internet

Author: User

Tags float double

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

These two days to read this article, I would like to share here, but I still only record a blog with other people do not, or their own ideas (PS: Because sometimes when the blog found that each article is the same or very annoying =). In order not to repeat the work of predecessors, I post an accidentally turned to the blog weight value Simplification (1): Three-valued neural network (ternary Weight Networks), the whole thesis content and implementation are all very comprehensive, you can read it, I also learn from it.
The main points of work in this article are in three areas:

Increased the network's expressive power (expressive ability). Added $\alpha$ as scaled factoron the basis of {1,0,1};
Compresses the model size. Of course the main is weight compression. There is a 16~32x elevation compared to FPWN (fullprecision Weight network), but the 2x size of BPWN (binary Precision Weight network) ( PS: Of course, in the TWN Caffe code, all are stored by the float double type, because this needs to be implemented in the aspect of the should.
Reduce computing needs. Mainly compared to the bpwn increased by 0, of course, this aspect also need hardware to get promotion, in the Caffe code is not;

Ternary quantization

In my understanding, the core content of the text is: to be constrained and interdependent between the two variables of the optimization problem, and gradually split the final use of a priori statistical methods to solve myopia.
The initial optimization problem:

The $w^{t}$ constraints are materialized as:

and bring it into the formula (1), the optimization of $w^{t}$ is converted to $\delta^$ optimization:

Then the $\alpha$ in the formula (4) is biased to obtain:

Because $\alpha$ and $\delta$ depend on each other, the (5) substituting (4) is eliminated $\alpha$:

But the problem comes, the formula (6) still cannot be begged, and the text is based on a priori knowledge, assuming $w_i$ obey $n (0,\sigma^2) $ distribution, myopic $\delta^$ for $0.6\sigma$ ($0.6\sigma$ equals $0.75e (| w|) $). So the author uses a rude methodto set $\delta^ $ to $\delta^*\approx0.7e (| w|) \approx\frac{n}{0.7}\sum_{i=1}^n| w_i|$

//caffe-twns//blob.cppTemplate<TypeNameDtype>voidBlob<dtype>::set_delta () {floatScale_factor = Ternary_delta *1.0/Ten;//delta = 0.7Dtype Delta = (Dtype) scale_factor * This->asum_data ()/ This->count ();//0.7* (e| w_i|) /numDelta = (Delta <= -) ? Delta: -; Delta = (Delta >=-100) ? Delta:-100; This->delta_ = Delta;}Template<TypeNameDtype>voidBlob<dtype>::set_delta (Dtype delta) {delta = (Delta <= -) ? Delta: -; Delta = (Delta >=-100) ? Delta:-100; This->delta_ = Delta;}

Implement

I borrowed a picture.

Step to the other, where the 5th step code is above:

Template<TypeNameDtype>voidBlob<dtype>::ternarize_data (Phase Phase) {if(phase = = RUN) {//if (DEBUG) Print_head (); //log (INFO) << "RUN phase ..."; //Caffe_sleep (3); return;// do nothing for the running phase}Else if(phase = = TRAIN) {//log (INFO) << "TRAIN phase ..."; //Caffe_sleep (3);}Else{//log (INFO) << "TEST phase ..."; //Caffe_sleep (3);}//const Dtype delta = 0;//default value;  //const Dtype delta = (Dtype) 0.8 * This->asum_data ()/This->count ();   This->set_delta ();//defualt 0.7* (e| w_i|) /num or set by user  ConstDtype Delta = This->get_delta (); Dtype Alpha =1;if(!data_) {return; }Switch(Data_->head ()) { Casesyncedmemory::head_at_cpu:{caffe_cpu_ternary<dtype> ( This->count (), Delta, This->cpu_data (), This->mutable_cpu_binary ());//quantized weight to ternaryAlpha = Caffe_cpu_dot ( This->count (), This->cpu_binary (), This->cpu_data ());//scale-alpha: (E |   w_i|) I belong to I_deltaAlpha/= Caffe_cpu_dot ( This->count (), This->cpu_binary (), This->cpu_binary ());//(1/num_binary) *alphaCaffe_cpu_scale ( This->count (), Alpha, This->cpu_binary (), This->mutable_cpu_binary ());//This->set_alpha (alpha);}return; CaseSYNCEDMEMORY::HEAD_AT_GPU: Casesyncedmemory::synced:#ifndef cpu_only{caffe_gpu_ternary<dtype> ( This->count (), Delta, This->gpu_data (), This->mutable_gpu_binary ()); dtype* PA =NewDtype (0); Caffe_gpu_dot ( This->count (), This->gpu_binary (), This->gpu_data (), PA); dtype* PB =NewDtype (0); Caffe_gpu_dot ( This->count (), This->gpu_binary (), This->gpu_binary (), PB); Alpha = (*PA)/((*PB) +1e-6); This->set_alpha (Alpha); Caffe_gpu_scale ( This->count (), Alpha, This->gpu_binary (), This->mutable_gpu_binary ());//This->set_alpha ((Dtype) 1);    //LOG (INFO) << "alpha =" << alpha;    //Caffe_sleep (3);}return;#elseNo_gpu;#endif   CaseSyncedmemory::uninitialized:return;default: LOG (FATAL) <<"Unknown syncedmemory head State:"<< Data_->head (); }}

Step 6~7, where the author in the 6th step uses the traditional Caffe method directly in Caffe-twns, while $z=xW\approx x(\alpha w^t) = (\alpha x) \bigoplus w^t $ More biased with hardware-accelerated optimizations (since the ternary itself in the caffe-twns takes a float or double, and is accelerated with Blas or CUDNN and cannot skip the 0 value directly):

//conv_layer.cppTemplate<TypeNameDtype>voidCONVOLUTIONLAYER&LT;DTYPE&GT;::FORWARD_CPU (Constvector<blob<dtype>*>& Bottom,Constvector<blob<dtype>*>& top) {//Const dtype* weight = This->blobs_[0]->cpu_data ();if(BINARY) { This->blobs_[0]->binarize_data ();}if(ternary) { This->blobs_[0]->ternarize_data ( This-&GT;PHASE_);//quantized from blob[0] to ternary sand stored in cpu_binary ()/*Dtype alpha = (Dtype) this->blobs_[0]->get_alpha ();for (int i=0; i<bottom.size (); i++) {blob<dtype>* Blob = bottom[i];Caffe_cpu_scale (Blob->count (), Alpha, Blob->cpu_data (), Blob->mutable_cpu_data ());}*/}Constdtype* weight = (BINARY | | Ternary)? This->blobs_[0]->cpu_binary (): This->blobs_[0]->cpu_data ();..}

Step 11~19,weight Update is on the full precision , while the calculation gradient is ternary weight :

//conv_layer.cppTemplate<TypeNameDtype>voidCONVOLUTIONLAYER&LT;DTYPE&GT;::BACKWARD_CPU (Constvector<blob<dtype>*>& Top,Constvector<BOOL>& Propagate_down,Constvector<blob<dtype>*>& bottom) {Constdtype* weight = This->blobs_[0]->cpu_data (); dtype* Weight_diff = This->blobs_[0]->mutable_cpu_diff (); for(inti =0; I < top.size (); ++i) {...if( This->param_propagate_down_[0] || Propagate_down[i]) { for(intn =0; N < This->num_; ++n) {//Gradient w.r.t. Weight. Note that we'll accumulate diffs.        if( This->param_propagate_down_[0]) { This->weight_cpu_gemm (Bottom_data + n * This->bottom_dim_, Top_diff + n * This->top_dim_, Weight_diff); }//Gradient w.r.t. Bottom data, if necessary.        if(Propagate_down[i]) { This->backward_cpu_gemm (Top_diff + n * This->top_dim_, weight, Bottom_diff + n * This-&GT;BOTTOM_DIM_); }      }    }  }}

Ternary weight Networks

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More