In sklearn, what kind of data does the classifier regression apply ?, Sklearn Regression
Author: anonymous user
Link: https://www.zhihu.com/question/52992079/answer/156294774
Source: zhihu
Copyright belongs to the author. For commercial reprint, please contact the author for authorization. For non-commercial reprint, please indicate the source.
(Sklearn official guide: Choosing the right estimator)
0) select an appropriate Machine Learning Algorithm
All models are wrong, but some models are useful.-George Box (Box & Draper 1987)
According to No free lunch theorem, in machine learning, there is No best model/Algorithm in all aspects, because each model has more or less a prior statistical hypothesis for data distribution. Take the average of all possible data distributions, and each model performs the same (or worse ). Therefore, we need to find the best machine learning algorithm for specific problems.
1) Exploratory Data Analysis)
Before selecting a specific algorithm, it is best to have a certain understanding of the pattern and Generation Principle of each feature in the data:
- Is the feature continuous (real-valued) or discrete (discrete )?
- If a feature is continuous, what is its histogram (histogram? How is its mean and variance distributed?
- If the feature is discrete, is there any sequential relationship between different feature values? For example, although the score of one to five stars on Douban is discrete data, there is a descending order. If a feature is "Address", it is unlikely that there is a clear order.
- How is feature data collected?
2) Feature Engineering)
Feature Engineering (based on existing features, creating new and valuable features) determines the upper limit of machine learning capabilities. Various algorithms are just approaching this upper limit. Different machine learning algorithms generally have different feature engineering. In practice, the feature engineering and algorithm parameter adjustment steps are often reciprocating.
3) simplified to complex: select a specific Algorithm
Sklearn includes many machine learning algorithms. To simplify the problem, we will only discuss several common classifier and regressor categories here. As for the principles of algorithms, sklearn documents often contain references for each algorithm, and machine learning textbooks are also involved.
3.1) General Linear Models
When I first set up a model, I usually choose a linear model of high bias and low variance. The linear model has the following advantages: low computational workload, fast speed, low memory usage, and not easy to overfit.
Common linear regression tools include Ridge (linear regression with L2 regularization) and Lasso (linear regression with L1 regularization, which comes with feature selection to obtain sparse coefficients ). At the same time, if there are no special requirements for super parameters, you can use RidgeCV and LassoCV provided by sklearn to automatically determine the value of super parameters through efficient cross-validation.
If you want to predict more than one y value (m samples * n targets) for the same dataset X (m samples * n features), you can use the multitask model.
LogisticRegression and corresponding LogisticRegressionCV are recommended in linear classifiers.
SGDClassifier and SGDRegressor can be used for large datasets. However, if the dataset is too large, it is best to sample it from the data and analyze and model it like small data. It is not necessary to run the algorithm on the entire dataset at the beginning.
3.2) Ensemble Methods
Ensemble can greatly improve various algorithms, especially the performance of decision trees. In practical applications, decision trees are rarely used. Bagging (such as RandomForest) trains a group of high variance algorithms in different parts of data to reduce the overall variance of algorithms. boosting builds high bias algorithms in sequence to improve the overall variance.
The most common ensemble algorithms are RandomForest and GradientBoosting. However, in addition to sklearn, there are also better gradient boosting algorithm libraries: XGBoost and LightGBM.
BaggingClassifier and VotingClassifier can be used as the meta classifier/regressor of the second layer, and the first-layer algorithm (such as xgboost) can be used as the base estimator to further construct a bagging or stacking.
I personally prefer to use this type of model.
3.3) Support Vector Machine (SVM)
For more information about SVM, see Professor Andrew Ng's CS229 on Coursera. (If you have the ability, visit the original CS229 on youtube or Netease open course ). The svm API documentation is very well-developed, and it is not very difficult to adjust the bag. However, in most data mining competitions (such as kaggle), SVM is often inferior to xgboost.
3.4) Neural Network)
Compared with the industry's top Neural Network Libraries (such as TensorFlow and Theano), sklearn's neural network is relatively simple. Personally, if you want to use a neural network for classification/regression, I usually use keras or pytorch.
Author: maid
Link: https://www.zhihu.com/question/52992079/answer/132946166
Source: zhihu
Copyright belongs to the author. For commercial reprint, please contact the author for authorization. For non-commercial reprint, please indicate the source.
Update... The question is how to use sklearn. I did not mention the specific sklearn... What language is used for machine learning and what package is not the key.
It is also the sklearn party in normal times, and the method is also the same. We recommend that you write a function by yourself and call different classification methods to use the default parameters for cross-validation. Then print the score output or draw the ROC curve. More knowledgeable datasets, more experience, better probability statistics, and a higher level... I am still in the process of leveling.
----------------
The original answer is as follows:
Thank you. There have been two recent competitions, so I am very busy. So I just want to give a general introduction and provide some direction for the subject to further research.
The classification/regression tool is based on data features.
For example, when predicting whether a product is purchased by a customer, the target is whether the customer buys a product, that is, true or false (which can be expressed by 1 or 0) and binary classification.
Then, the training set information that can be used for analysis, except for targets, the age, gender, companion or not, clothing style, price, type, and brand of the customer.
We observe that age is an integer and the interval between these integers is meaningful-for the customer's clothing style, if it is also an integer, such as, 4... This is the case.
Different integers indicate different styles, so the interval between these integers lacks a practical explanation, because, these sequences are just a definite order of words such as "western formal attire" and "leisure" in "clothing style. You think, there is a small difference in age between the two samples when the other features are the same. We speculate that the two samples may have the same performance (target );
However, if there is a small difference in clothing style values, I feel that fashionable young people like High Trousers, if the height of the pants is 30, and the height of the belt is 31, do you think the two targets can be the same?
For the European space with several features, some algorithms tend to cut it into several pieces, each of which has a similar nature, but the high belts and High Trousers obviously cannot be similar.
For a method like SVM Based on the split space, when the continuity of the above "clothing style" is very bad, its error may be very large, there are exceptions (the performance of close samples varies greatly ).
The tree-based approach does not care about the continuity of functions (from the space opened by the feature to the number set {0, 1} of the target value, for convenience of recording as F) during decision making, the consideration is how to make proper division on each feature, which will effectively avoid the error brought about by the split space (although in fact, it also separates the space, but its theoretical basis is not this ).
At the same time, tree models also bring about good benefits when the features are distinct from each other.
If linear models are required for classification, dummy encoding can be performed for non-consecutive features of F. Dummy encoding can make some non-linear features more linear, at the cost of sparsity, this is common in natural language processing.
The above is only part of the process of analyzing features and selecting classifier. The subject can take a look and it should be enlightening.
Let's talk about the experience of life:
1. sparse large-scale data: SGD and logistic regression.
2. Normalized real data: SVM
3. Poor data preprocessing and weak classifier integration.
4. Obvious probability distribution problem: Bayesian (rarely used, but more effective in text processing problems with large corpus ).
5. Neural Network relu, relu, relu 1_timeout