Classifier lifting accuracy is mainly through the combination of multi-classifier results, the final results are categorized.
There are three main combinations of methods: bagging (bagging), lifting (boosting) and immediate forest.
Steps for bagging and lifting methods:
1, generating several training sets based on the learning data set
2, use the training set to generate several classifiers
3, each classifier is forecasted, through simple election (bagging) or complex election (promotion), to determine the final result.
As shown, for DataSet D, obtain a subset of D1~DK, then use M1~MK different classifiers for the classification training, and then use the test set (SGD group) to obtain the predicted results,
Finally to these K results use a minority to obey majority principle to determine. If the 99 classification method to get 55 results is 1, 44 results are 0, the final result is determined to be 1.
In the lifting (boosting) algorithm, it can be regarded as the improvement of bagging, which can be understood as weighted voting. The adaptive boosting algorithm is introduced here specifically
The algorithm basically and bagged consistent, is the new introduction of a weight concept, first, in (1) initialization, the weight 1/d, that is, each tuple (Di) weights consistent, in the ground (9) ~ (11) Step,
The weights are constantly refreshed, and here we can see that the weights of the tuples that are correctly categorized are always multiplied by a number less than 1, that is, the tuple that is correctly categorized, the likelihood of being selected as the training set DI is reduced,
The classifier will focus on "hard to classify" data. We are based on a belief that "some classifier may have a good effect on a particular sort of data".
Add: Tuple concept: Tuple is the smallest data unit, such as a person is a tuple, have height, weight and other attributes.
After the data is trained, it is a combination classifier.
Here we see, there appears a weight, the classifier of the voting weight, this weight is according to the classifier accuracy rate (the lower the error rate, the higher the weight).
The next step is to introduce a decision tree lifting algorithm: Random forest.
The random forest is actually very intuitive, that is, using the random bagging method mentioned above, for each DI construction decision number, here with the cart algorithm (only need to calculate the Gini index), not pruning.
Then vote for all the trees in the forest.
Examples of random forests with R language:
If the Randomforest package is not installed, first install.packages ("Randomforest")
Library (randomforest) model.forest = Randomforest (species~.,data=iris) pre.forest=predict (model.forest,iris) Table ( Pre.forest,iris$species)
Accurate rate up to 100%
And with a single decision tree
Library (rpart) Model.tree " class " ) Pre.tree=predict (model.tree,data=iris,type="class") Table (Pre.tree, Iris$species)
found that some of the data will be judged wrong.
PS: The algorithm of combinatorial classifier is excerpted from Han Jiawei "data mining concept and technology".
The combination algorithm of classifier to improve the accuracy of the summary