Objective
In the classical machine learning algorithm, the importance of decision tree algorithm must be known to everyone. Whether the ID3 algorithm or the C4.5 algorithm, and so on, are faced with a problem, that is, through the direct generation of the full decision tree for training samples is "over-fitting", plainly is too accurate. This is not the best decision tree to analyze new data, since the full decision tree is "too precise" to describe the characteristics of the training sample and cannot achieve a reasonable analysis of the new sample. The solution to this problem is to prune the decision tree and cut out the branches that affect the accuracy of the prediction. There are two kinds of pruning strategies: pre-pruning (pre-pruning) technique and post-pruning (post-pruning) technique. The pre-pruning technique is mainly to restrict the full growth of decision tree by establishing some rules, and then pruning after the decision tree is fully grown. Since the use of pre-pruning technology is less, this series will focus on the post-pruning technology, this article will introduce the pessimistic pruning technology.
I. Review of statistics-related knowledge
1, Confidence interval:
Set Theta ' Under the large sample to obey E (θ ') =θ, the standard error is σ ' normal distribution, then the θ (1-α) 100% confidence interval is:
Θ ' +/-(Zα/2) σ '
2, two-term probability distribution:
The mean and variance are u = np,σ2=npq , which p= the probability of success of each experiment, q=1-p.
Normal approximation of 3 and two distributions
If Np>=4 and nq>=4, the two probability distributions P (Y) are approximated to the normal distribution. Such as
You can see that P (y<=2) is the area of the left end of the y=2.5 under normal curves. It is not appropriate to notice the area of the left side of the y=2 because it omits a rectangle corresponding to half the probability of y=2. To fix this, we use a continuous probability distribution to approximate the discrete probability distribution, and we need to increase the 2 by 0.5 before calculating the probability. The value 0.5 is called the continuity correction factor of the two probability distribution approximation, so
P (y<=a) is approximately equal to p (z< (a+0.5-np/(npq)));
P (y>=a) is approximately equal to P (z> (a-0.5-np/(NPQ) 1/2))
Second, pruning process
for post-pruning techniques, the first thing to do after the decision tree formation is pruning. The pruning process for pruning is to remove some subtrees and then replace them with their leaf nodes, which are identified by most principles (majority class criterion). The so-called majority principle, refers to the pruning process, the deletion of some subtrees and the use of leaf nodes instead, the category identified by the leaf node is identified by the category of most training samples in this subtrees tree, the identified class Called Majority class, (majority class is also seen in many English literature).
Third, pessimistic pruning--pessimistic Error pruning (PEP)
PEP post-pruning technique was presented by Master Quinlan. It does not need to be like rep (error rate reduction pruning) sample, need to use part of the sample as test data, instead of using training data to generate decision tree, and use these training data to complete pruning. decision Tree Generation and pruning are all using training sets, so there is a wrong score. Now let's introduce a few definitions.
T1 for all internal nodes of the decision tree T (non-leaf nodes),
T2 for all leaf nodes of the decision Tree T,
T3 for all nodes of T, with T3=t1∪t2,
N (t) is the number of all samples of T,
ni(t) is the number of all samples of Class I in T,
E (t) is the number of samples in t that are not part of the category identified by node T
When pruning, we use
R (t) =e (t)/n (t)
is the error rate on the training set when the node is pruned, and
, where S is the leaf node of the T node.
Here, we see the error distribution as a two-item distribution, which is explained by the "normal approximation of the two-item distribution" above, which is biased and therefore requires a continuous correction factor to correct the data.
R ' (t) =[e (t) + 1/2]/n (t)
And
, where S is the leaf node of the T node, the number of all the leaf nodes that you don't know that sign is T
For simplicity, we only use the number of errors rather than the error rate, as follows
E ' (t) = [E (t) + 1/2]
The standard deviation of E ' (Tt) is then obtained, since the error approximation is considered to be a two-term distribution, according to u = NP,Σ2=NPQ
When the node T satisfies
The TT will be cut off.
Iv. Summary
In Learning machine learning, because of the knowledge involved in a lot of, and a wide range, so we must be mathematics, statistics, algorithms and other relevant knowledge of the thorough, more summary induction. And this knowledge is generally more obscure, but look at other people's blogs often because of other people's understanding of the knowledge of the wrong, and lead to the reader's own misleading, and blog is not authoritative, do not guarantee the correct, so the machine learning this rigorous discipline is a need for more reference, more reading especially documents, and even algorithmic authors of the paper. At the same time I understand the wrong place, welcome to point out, once again expressed thanks.
V. Recommended READING
To learn about other pruning algorithms (REP, MEP, EBP) You can refer to this article http://52weis.com/articles.html?id=718_21
Vi. Reference Documents
A comparative analysis of Methods for pruning decision Trees 1997 (ISSUE)
Pruning Theory of decision tree
Decision Tree Theory
C4.5 Decision Tree
The EFFECTS of pruning METHODS on the predictive accuracy of induced (ISSUE)
Research on post-pruning algorithm of decision tree Fan Jie Yang Yuexi (ISSUE)
Comparison of pruning methods for decision Trees Wei Hongning 2005 (ISSUE)
Application of pessimistic pruning algorithm in the decision tree of Students ' achievement Li Ping (ISSUE)
I want you to understand. Machine Learning Series--Pessimistic pruning algorithm for decision Tree algorithm (PEP)