Deeplearning.ai Summary-C + + implementation ADMA optimization
Flyfish
Compiling environment
vc++2017
The theory is excerpted from "deep learning"
Adam a learning Rate adaptive optimization algorithm the name "Adam" derives from the phrase "adaptive moments".
In the context of early algorithms, it may be best to be seen as a variant that combines rmsprop with some important differences in momentum.
First, in Adam, momentum is directly incorporated into the estimation of the gradient first-order moment (exponential weighting). The most intuitive way to add momentum to the rmsprop is to apply momentum to the scaled gradient.
There is no clear theoretical motivation for using the momentum of scaling. Second, Adam includes a bias correction, a correction of the first-order moment (momentum term) initialized from the origin, and an estimate of the second-order moment (non-center).
The Rmsprop also uses the (non centered) second-order moment estimation, but the correction factor is missing. Therefore, unlike Adam,rmsprop, second-order moment estimation may have a high bias at the beginning of training.
Adam is often considered a fairly robust choice of parameters, although learning rates sometimes need to be modified from suggested defaults.
Paper address//http://arxiv.org/abs/1412.6980 #include <vector> #include <unordered_map> template <typename T
, TypeName func> inline void For_i (T size, Func f) {for (size_t i = 0; i < size; ++i) {f (i);
the typedef std::vector<double> TENSOR2D;
Class Adam {Public:adam (): Alpha (double (0.001)), B1 (double (0.9)), B2 (double (0.999)), b1_t double (0.9), b2_t double (0.999), EPS (double (1e-8)) {} void Update (const tensor2d &DW, Te
Nsor2d &w) {tensor2d &mt = get<0> (W);
tensor2d &VT = get<1> (W);
for (Auto it = Dw.begin (); it!= dw.end (); it++) std::cout << *it << "T";
Std::cout << "DW \ n";
for (Auto it = Mt.begin (); it!= mt.end (); it++) std::cout << *it << "T";
Std::cout << "MT \ \"; for (Auto it = Vt.begin (); it!= vt.end (); it++) std:: cout << *it << "T";
Std::cout << "VT \";
for (const auto& N:E_)//address hash as key {for (Auto it = N.begin (); it!= n.end (); it++)
{std::cout << (*it). << ":"/<< (*it) second << Std::endl; for (auto s = (*it). Second.begin (); s!= (*it). Second.end (); s++) {Std::cout
<< (*s);
} std::cout << "complete \"; } For_i (W.size (), [ampersand] (size_t i) {Mt[i] = B1 * Mt[i] + (double (1.0)-B1) * dw[i];//m Omentum Vt[i] = B2 * Vt[i] + (double (1.0)-B2) * dw[i] * Dw[i];//rmsprop double mt_hat = mt[i]/
(double (1)-b1_t);
Double vt_hat = vt[i]/(double (1.0)-b2_t);
L2 norm W[i]-= Alpha * (mt_hat)/(Std::sqrt (vt_hat) + EPS);
});
b1_t *= B1; B2_T *= B2;
}//Learning rate or step factor, which controls the update ratio of weights (such as 0.001).
Larger values (such as 0.3) will have faster initial learning before the learning rate is updated, and//the smaller values (such as 1.0E-5) will converge to better performance. Double Alpha;
Learning Rate//first-order moment estimation of exponential decay rate (e.g. 0.9). Double B1;
The exponential decay rate of the second-order moment estimation (e.g. 0.999). The super parameter should be set to close to 1 double B2 in a sparse gradient, such as in NLP or computer vision tasks; Double b1_t; Square double b2_t of B1;
B2 Square Private://This parameter is very small number, in order to prevent in the implementation divided by 0 (such as 10E-8).
Double EPS; Private:template <int index> tensor2d &get (const tensor2d &key) {if (e_[index][&k
Ey].empty ()) E_[index][&key].resize (Key.size ()), double ());
Return e_[index][&key];
} std::unordered_map<const tensor2d *, tensor2d> e_[2]; };