In machine learning, the classifier function is to determine the category of a new observation sample based on the training data that is tagged with a good category. The classifier can be divided into non-supervised learning and supervised learning according to the way of learning. Unsupervised learning, by definition, refers to the sample that is given to the classifier but has no corresponding category tag, mainly to find the hidden structure in unlabeled data. , the supervised learning infers the classification function by marking the training data, which can be used to map the new sample to the corresponding label. In the supervised learning approach, each training sample includes the characteristics of the training sample and the corresponding label. The process of supervised learning includes determining the type of training sample, collecting the training sample set, determining the input feature representation of the learning function, determining the structure of the learning function and corresponding learning algorithm, completing the whole training module design, and evaluating the accuracy of the classifier. The purpose of this section is to select the classifier. The appropriate classifier can be selected according to the following four points.
1. The tradeoff between generalization capability and fit
An over-fitting evaluation is the performance of the classifier on the training sample. If the accuracy of a classifier on the training sample is high, the classifier can fit the training data well. But a good fit training data classifier There is a large bias, so in the test data will not be able to get good results. If a classifier can get good results on the training data, but the effect on the test data is serious, it indicates that the classifier has fitted the training data. From another aspect, if the classifier can achieve good results on the test data, the generalization ability of classifier is strong. The generalization and fitting of classifiers is a process of this elimination, and the ability to fit the classifiers with strong generalization ability is generally weak, and the other is the inverse. So the classifier needs to strike a balance between generalization ability and fitting ability.
2. complexity of classification functions and size of training data
The size of the training data is also very important for the selection of classifiers, and if it is a simple classification problem, the classifier with weak generalization ability can be obtained by a small part of the training data. Conversely, if it is a complex classification problem, then the classifier learning needs a lot of training data and strong generalization ability of the learning algorithm. A good classifier should be able to automatically adjust the balance between fit and generalization capabilities based on the complexity of the problem and the size of the training data.
3. dimension of the input feature space
If the vector dimension of the input feature space is very high, it will complicate the classification problem, even if the final classification function depends on only a few features. This is because the high feature dimension confuses the learning algorithm and leads to the generalization ability of the classifier is too strong, and the generalization ability too strong will make the classifier change too much, performance is degraded. Therefore, the classifier with high-dimensional eigenvector input needs to adjust the parameters so that its generalization ability is weak and the fitting ability is strong. In addition, the performance of the classifier can be improved by removing extraneous features from the input data or decreasing the feature dimension.
4. The uniformity of the input eigenvectors and their relationship to each other
If the eigenvectors contain multiple types of data (such as discrete, continuous), many classifiers such as SVM, linear regression, and logistic regression do not apply. These classifiers require that the input feature must be a number and be normalized to a similar extent as between. The classifier with distance function is more sensitive to the uniformity of data, such as the K nearest neighbor algorithm and the SVM of Gaussian kernel. But another classifier decision tree can handle these uneven data. If there are multiple input eigenvectors, each of which is independent of each other, that is, the classifier output of the current eigenvector is only related to the current eigenvector input, then it is better to select those classifiers based on linear functions and distance functions such as linear regression, SVM, naive Bayes, etc. Conversely, if there are complex correlations between eigenvectors, decision trees and neural networks are more suitable for such problems.
How to choose classifier in machine learning