Download this document in PDF format (academia.edu)
In this paper, a simple and elementary proof of the adaboost algorithm in machine learning is given, and the mathematical tool needed is calculus-1.
Adaboost is a powerful algorithm for predicting models. However, a major disadvantage is, Adaboost may leads to over-fit in the presence of noise. Freund, Y. & Schapire, R. E. (1997) proved that the training error of the ensemble are bounded by the following Express Ion: \begin{equation}\label{ada1}e_{ensemble}\le \prod_{t}2\cdot\sqrt{\epsilon_t\cdot (1-\epsilon_t)} \end{ Equation} where $\epsilon_t$ is the error rate of each base classifier $t $. If the error rate was less than 0.5, we can write $\epsilon_t=0.5-\gamma_t$, where $\gamma_t$ measures how much better the Classifier is than the random guessing (on binary problems). The bound on the training error of the ensemble becomes \begin{equation}\label{ada2} e_{ensemble}\le \prod_{t}\sqrt{1-4{\g Amma_t}^2}\le e^{-2\sum_{t}{\gamma_t}^2} \end{equation} Thus if each base classifier are slightly better than random so tha T $\gamma_t>\gamma$ for some $\gamma>0$ and then the training error drops exponentially fast. Nevertheless, because of ITS tendency to focus on training examples that is misclassified, Adaboost algorithm can be quite susceptible to Over-fitt Ing. We'll give a new simple proof of \ref{ada1} and \REF{ADA2}; Additionally, we try to explain what the parameter $\alpha_t=\frac{1}{2}\cdot\log\frac{1-\epsilon_t}{\epsilon_t}$ in Boosting algorithm.
AdaBoost algorithm:
Recall the boosting algorithm Is:given $ (x_1, y_1), (X_2, y_2), \cdots, (X_m, Y_m) $, where $x _i\in x, Y_i\in y=\{-1, +1\} $.
Initialize $D _1 (i) =\frac{1}{m}$. For $t =1, 2, \ldots, t$: Train weak learner using distribution $D _t$. Get weak hypothesis $h _t:x\rightarrow \{-1, +1\}$ with error \[\epsilon_t=\pr_{i\sim d_t}[h_t (x_i) \ne y_i]\] If $\epsilo N_i >0.5$, then the weights $D _t (i) $ is reverted back to their original uniform values $\frac{1}{m}$.
Choose \begin{equation}\label{boost3} \alpha_t=\frac{1}{2}\cdot \log\frac{1-\epsilon_t}{\epsilon_t} \end{ Equation}
Update: \begin{equation}\label{boost4} d_{t+1} (i) =\frac{d_{t} (i)}{z_t}\times \left\{\begin{array}{c C} e^{ -\alpha_t} & \quad \textrm{if $h _t (x_i) =y_i$}\\ e^{\alpha_t} & \quad \textrm{if $h _t (x_i) \ne y_i$} \end{array} \ri Ght. \end{equation} where $Z _t$ is a normalization factor.
Output the final hypothesis.
Proof: \[h (x) =sign (\sum_{t=1}^{t}\alpha_t\cdot h_t (x)) \] Firstly, we'll prove \ref{ada1}, note that $D _{t+1} (i) $ is the Distri Bution and its summation $\sum_{i}d_{t+1} (i) $ equals 1, hence \[z_t=\sum_{i}d_{t+1} (i) \cdot z_t=\sum_{i}d_t (i) \times \ Left\{\begin{array}{c C} e^{-\alpha_t} & \quad \textrm{if $h _t (x_i) =y_i$}\\ e^{\alpha_t} & \quad \textrm{if $h _t ( x_i) \ne y_i$} \end{array} \right.\] \[=\sum_{i:\ h_t (x_i) =y_i}d_t (i) \cdot e^{-\alpha_t}+\sum_{i:\ h_t (x_i) \ne y_i}D_t ( i) \cdot e^{\alpha_t}\] \[=e^{-\alpha_t}\cdot \sum_{i:\ h_t (x_i) =y_i}d_t (i) +e^{\alpha_t}\cdot \sum_{i:\ h_t (x_i) \ne Y_ i}d_t (i) \] \begin{equation}\label{boost5} =e^{-\alpha_t}\cdot (1-\epsilon_t) +e^{\alpha_t}\cdot \epsilon_t \end{ Equation} in order to find $\alpha_t$ we can minimize $Z _t$ by making it first order derivative equal to 0. \[{[e^{-\alpha_t}\cdot (1-\epsilon_t) +e^{\alpha_t}\cdot \epsilon_t]}^{'}=-e^{-\alpha_t}\cdot (1-\epsilon_t) +e^{\ Alpha_t}\cdot \epsilon_t=0\] \[\rightarrow \alpha_t=\frac{1}{2}\cdot \loG\frac{1-\epsilon_t}{\epsilon_t}\] which is \ref{boost3} in the boosting algorithm. Then we put $\alpha_t$ \ref{boost5} \[z_t=e^{-\alpha_t}\cdot (1-\epsilon_t) +e^{\alpha_t}\cdot \epsilon_t=e^{-\ Frac{1}{2}\log\frac{1-\epsilon_t}{\epsilon_t}}\cdot (1-\epsilon_t) +e^{\frac{1}{2}\log\frac{1-\epsilon_t}{\ Epsilon_t}}\cdot\epsilon_t\] \begin{equation}\label{boost6} =2\sqrt{\epsilon_t\cdot (1-\epsilon_t)} \end{equation} On the other hand, derive from \ref{boost4} we have \[d_{t+1} (i) =\frac{d_t (i) \cdot e^{-\alpha_t\cdot y_i\cdot h_t (x_i)}}{z _t}=\frac{d_t (i) \cdot e^{k_t}}{z_t}\] Since the product would either be $1$ if $h _t (x_i) =y_i$ or $-1$ if $h _t (x_i) \ne y _i$. Thus we can write down all of the The equations \[d_1 (i) =\frac{1}{m}\] \[d_2 (i) =\frac{d_1 (i) \cdot e^{k_1}}{z_1}\] \[d_3 (i) =\fr Ac{d_2 (i) \cdot e^{k_2}}{z_2}\] \[\ldots\ldots\ldots\] \[d_{t+1} (i) =\frac{d_t (i) \cdot e^{k_t}}{z_t}\] Multiply all Equalities above and obtain \[d_{t+1} (i) =\frac{1}{m}\cdot\frac{e^{-y_i\cdot f (x_i)}}{\prod_{t}z_t}\] where $f (x_i) =\sum_{t}\alpha_t\cdot h_t (x_i) $. Thus \begin{equation}\label{boost7} \frac{1}{m}\cdot \sum_{i}e^{-y_i\cdot f (x_i)}=\sum_{i}d_{t+1} (i) \cdot\prod_{t} z_t=\prod_{t}z_t \end{equation} Note that if $\epsilon_i>0.5$ the data set would be re-sampled until $\epsilon_i\le0.5$. In and words, the parameter $\alpha_t\ge0$ in each valid iteration process. The training error of the ensemble can be expressed as \[e_{ensemble}=\frac{1}{m}\cdot\sum_{i}\left\{\begin{array}{c C} 1 & \quad \textrm{if $y _i\ne h_t (x_i) $}\\ 0 & \quad \textrm{if $y _i=h_t (x_i) $} \end{array} \right. =\frac{1}{m}\cdot \sum_{i}\left\{\begin{array}{c C} 1 & \quad \textrm{if $y _i\cdot F (x_i) \le0$}\\ 0 & \quad \textr M{if $y _i\cdot F (x_i) >0$} \end{array} \right.\] \begin{equation}\label{boost8} \le\frac{1}{m}\cdot\sum_{i}e^{-y_i \cdot f (x_i)}=\prod_{t}z_t \end{equation} The last step derives from \ref{boost7}. According to \ref{boost6} and \ref{boost8}, we proved \ref{ada1} \begin{equation}\LABEL{BOOST9} e_{ensemble}\le \prod_{t}2\cdot\sqrt{\epsilon_t\cdot (1-\epsilon_t)} \end{equation} in order to prove \ REF{ADA2}, we have to firstly proof the following inequality: \begin{equation}\label{boost10} 1+x\le e^x \end{equation} Or The equivalence $e ^x-x-1\ge0$. Let $f (x) =e^x-x-1$, then \[f^{'} (x) =e^x-1=0\rightarrow x=0\] Since $f ^{"} (x) =e^x>0$, so \[{f (x)}_{min}=f (0) =0\ RightArrow E^x-x-1\ge0\] which is desired. Now we go back to \ref{boost9} and let \[\epsilon_t=\frac{1}{2}-\gamma_t\] where $\gamma_t$ measures how much better the C Lassifier is than the random guessing (on binary problems). Based on \ref{boost10} we have \[e_{ensemble}\le\prod_{t}2\cdot\sqrt{\epsilon_t\cdot (1-\epsilon_t)}\] \[=\prod_{t}\ Sqrt{1-4\gamma_t^2}\] \[=\prod_{t}[1+ ( -4\gamma_t^2)]^{\frac{1}{2}}\] \[\le\prod_{t} (e^{-4\gamma_t^2}) ^\frac{1}{2 }=\prod_{t}e^{-2\gamma_t^2}\] \[=e^{-2\cdot\sum_{t}\gamma_t^2}\] as desired.
A simple proof of the adaboost algorithm