Generally, the entire training set is divided into two parts: the appointment of data60-80%Put it into our training set to generate the model; then put the rest of the data into a test set, and immediately use it to test the accuracy of our model after the model is generated.
Why is this extra step so important in this model? This problem is called overfitting: if we provide too manyData is used for model creation. Although our model will be perfectly created, it only applies to this data. Remember: we want to use this model to predict future unknowns. We do not want to use this model to accurately predict the values we already know. This is why we need to create a test set. After creating a model, we need to check to ensure that the accuracy of the model we created will not decrease in the test set. This ensures that our model can accurately predict unknown values in the future.
Trim. TrimAs the name implies, it means to cut down the branches of the classification tree. So why does someone want to delete the information from the category tree? It is because of overfitting. As the dataset increases and the number of attributes increases, the trees we create become more and more complex. Theoretically, a tree can haveLeaves=(Rows*Attributes ). But what are the benefits? In terms of predicting future unknowns, it cannot help us because it is only suitable for our existing training data. Therefore, we need a balance. We want our trees to be as simple as possible, with as few nodes and leaves as possible. We also want it to be as accurate as possible.
False refers to a data instance where the model we created predicts that it should be positive, but the actual value is negative, on the contrary. Likewise, false negative refers to a data instance where the model we created predicts that it should be negative, but on the contrary, the actual value is positive.
These errors indicate a problem in our model, and our model is incorrectly classifying some data. Although incorrect classification may occur, the acceptable percentage of errors is determined by the model creator. For example, if you test the heart monitor in a hospital, it is clear that a very low percentage of errors is required. If you only mine some fictitious data in Data Mining articles, the error rate can be higher. To make it further, you also need to determine the acceptable percentage of false negative and false positive. An example I immediately came up with is the spam model: a false positive (a real email is marked as spam) is more negative than a false positive (a spam message is not marked as spam) more destructive. In such an example, we can determine the false negative: the false positive rate is the lowest100:1Is acceptable.