Machine learning (using AdaBoost meta-algorithm to improve classification performance)

Source: Internet
Author: User

The idea behind the meta-algorithm is a way to combine other algorithms, a

 fromNumPyImport*defloadsimpdata (): Datmat= Matrix ([[1., 2.1],        [ 2., 1.1],        [ 1.3, 1. ],        [ 1., 1. ],        [ 2., 1. ]]) Classlabels= [1.0, 1.0,-1.0,-1.0, 1.0]    returnDatmat,classlabelsdefLoaddataset (FileName):#General function to parse tab-delimited floatsNumfeat = Len (open (FileName). ReadLine (). Split ('\ t'))#get number of fieldsDatamat = []; Labelmat =[] FR=Open (FileName) forLineinchfr.readlines (): Linearr=[] CurLine= Line.strip (). Split ('\ t')         forIinchRange (numFeat-1): Linearr.append (float (curline[i)) datamat.append (Linearr) labelmat.append (float (curline[< /c10>-1]))    returnDatamat,labelmatdefStumpclassify (DATAMATRIX,DIMEN,THRESHVAL,THRESHINEQ):#just classify the dataRetarray = ones (Shape (datamatrix) [0],1))    ifThreshineq = ='LT': Retarray[datamatrix[:,dimen]<= Threshval] = 1.0Else: Retarray[datamatrix[:,dimen]> Threshval] = 1.0returnRetarraydefBuildstump (dataarr,classlabels,d): Datamatrix= Mat (Dataarr); Labelmat =Mat (classlabels). T M,n=shape (datamatrix) numsteps= 10.0; Beststump = {}; Bestclasest = Mat (Zeros (m,1)) Minerror= inf#init error sum, to +infinity     forIinchRange (N):#loop over all dimensionsRangeMin = Datamatrix[:,i].min (); RangeMax =Datamatrix[:,i].max (); Stepsize= (rangemax-rangemin)/numsteps forJinchRange ( -1,int (numsteps) +1):#Loop over all range in current dimension             forInequalinch['LT','GT']:#go over less than and greater thanThreshval = (rangemin + float (j) *stepsize) Predictedvals= Stumpclassify (datamatrix,i,threshval,inequal)#Call stump classify with I, J, LessThanErrarr = Mat (Ones (m,1)) Errarr[predictedvals= = Labelmat] =0 Weightederror= D.t*errarr#Calc Total error multiplied by D                #print "Split:dim%d, Thresh%.2f, Thresh ineqal:%s, the weighted error is%.3f"% (I, Threshval, inequal, Weighteder ROR)                ifWeightederror <Minerror:minerror=Weightederror bestclasest=predictedvals.copy () beststump['Dim'] =I beststump['Thresh'] =Threshval beststump['Ineq'] =inequalreturnbeststump,minerror,bestclasestdefAdaboosttrainds (dataarr,classlabels,numit=40): Weakclassarr=[] M=shape (Dataarr) [0] D= Mat (Ones ((m,1))/m)#Init D to all equalAggclassest = Mat (Zeros (m,1)))     forIinchRange (numit): Beststump,error,classest= Buildstump (dataarr,classlabels,d)#Build Stump        #print "D:", d.tAlpha = float (0.5*log (1.0-error)/max (error,1e-16))#Calc Alpha, throw in Max (error,eps) to account for error=0beststump['Alpha'] =Alpha Weakclassarr.append (beststump)#store Stump Params in Array        #print "Classest:", classest.tExpon = Multiply ( -1*alpha*mat (classlabels). T,classest)#exponent for D Calc, getting messyD = Multiply (D,exp (expon))#Calc New D for next iterationD = d/d.sum ()#Calc Training error of all classifiers, if this is 0 quit for loop early ( use break)Aggclassest + = alpha*classest#print "Aggclassest:", aggclassest.tAggerrors = Multiply (sign (aggclassest)! = Mat (classlabels). T,ones ((m,1)) Errorrate= Aggerrors.sum ()/mPrint "Total Error:", ErrorrateifErrorrate = = 0.0: Break    returnweakclassarr,aggclassestdefadaclassify (Dattoclass,classifierarr): Datamatrix= Mat (Dattoclass)#Do stuff similar to last aggclassest in Adaboosttraindsm =shape (Datamatrix) [0] Aggclassest= Mat (Zeros (m,1)))     forIinchRange (len (Classifierarr)): Classest= Stumpclassify (datamatrix,classifierarr[i]['Dim'], classifierarr[i]['Thresh'], classifierarr[i]['Ineq'])#Call Stump classifyAggclassest + = classifierarr[i]['Alpha']*classestPrintaggclassestreturnSign (aggclassest)defPlotroc (Predstrengths, classlabels):ImportMatplotlib.pyplot as PLT cur= (1.0,1.0)#cursorYsum = 0.0#variable to calculate AUCNumposclas = SUM (Array (classlabels) ==1.0) Ystep= 1/float (Numposclas); XStep = 1/float (len (classlabels)-Numposclas) sortedindicies= Predstrengths.argsort ()#get sorted index, it ' s reverseFig =plt.figure () FIG.CLF () Ax= Plt.subplot (111)    #Loop through all the values, drawing a line segment at each point     forIndexinchsortedindicies.tolist () [0]:ifClasslabels[index] = = 1.0: Delx= 0; Dely =Ystep; Else: Delx= XStep; Dely =0; Ysum+ = Cur[1]        #draw line from cur to (cur[0]-delx,cur[1]-dely)Ax.plot ([cur[0],cur[0]-delx],[cur[1],cur[1]-dely], c='b') cur= (cur[0]-delx,cur[1]-dely) Ax.plot ([0,1],[0,1],'b--') Plt.xlabel ('False Positive rate'); Plt.ylabel ('True Positive rate') Plt.title ('ROC curve for AdaBoost horse colic detection system') Ax.axis ([0,1,0,1]) plt.show ()Print "The area under the Curve is:", Ysum*xstep

Daboost is the most popular meta-algorithm and one of the most powerful tools in machine learning.

The combination of different algorithms can also be the same algorithm in different settings of the integration, can also be different parts of the data set assigned to different classifiers after the integration

Advantages: Low generalization error rate, easy coding, can be applied to most of the classifier, no parameters need to adjust

Cons: Sensitive to outliers.

Suitable for numerical nominal-scale data

Bagging is the technique of selecting S from the original dataset to get s new datasets, the new datasets are equal to the original dataset size, each dataset is randomly selected from the original dataset to be replaced by a sample, which allows the selection of duplicate values, while some values may not appear

After the S data is built, an algorithm is used for each data set to get the S classifier, when we classify the new data, we can use this s classifier to classify, select the classifier poll results of the most results as the final classification results

The more advanced bagging method is the random forest

Boosting is a technology similar to bagging, bagging is obtained through serial training, while boosting focuses on the part of the data that has been incorrectly divided by the classifier to obtain a new classifier.

The result of boosting is the result of weighted summation of all classifiers, bagging are equal weights, boosting weights are different, each weight represents the success of the classifier in the previous iteration

AdaBoost is one of the boosting.

The adaboost algorithm can be described in three steps:
(1) First, it is the weight distribution D1 of the initial training data. Assuming that there are N training sample data, each training sample is given the same weight at the very beginning: w1=1/n.
(2) Then, train the weak classifier hi. The specific training process is: If a training sample point, by the weak classifier Hi accurate classification, then in the construction of the next training set, its corresponding weight to reduce; Conversely, if a training sample point is incorrectly categorized, then its weight should be increased. The set of weights that have been updated is used to train the next classifier, and the entire training process goes on so iteratively.
(3) Finally, the weak classifiers of each training are combined into a strong classifier. After the training process of each weak classifier is finished, the weight of the weak classifier with small classification error rate is enlarged, which plays a larger role in the final classification function, while the weight of the weak classifier with large classification error rate is reduced, which plays a smaller role in the final classification function.
In other words, the weak classifier with low error rate occupies a larger weight in the final classifier, otherwise it is smaller.

Machine learning (using AdaBoost meta-algorithm to improve classification performance)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.