Classification method application scenarios Logistic regression, support vector machine, random forest, GBT, deep learning.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.
Characteristic dimension
Is it linearly separable
Are the features independent of each other
Features are linear dependent and target variable overfitting problems
Speed, effect, memory limit
logistic regression
Application scenario: feature is approximately linear (what does it mean?), data is linearly separable, non-linear features->linear
advantage:
1. robust to noise and void overfitting,feature selection by using l2 or l1 regularization
2. can be used in Big Data scenarios since it is pretty efficient and can be distributed using, ADMM
3. can be interpreted as probability
support vector machines
in practice: an SVM with a linear kernel is not very different from a Logistic Regression
Application scenarios that need to use svm: data is not linearly separable, and svm with non-linear cores is required (although logistic regression can also use different cores, it is better to use svm for practical reasons (what?)); another scenario Is the feature space dimension is very high, for example, svm has a better effect in text classification
Disadvantages:
SVM training is very time-consuming, so it is not recommended to use svm in the case of large training samples (how big?), or industrial-level data.
Tree ensembles
Random Forests and Gradient Boosted Trees
Advantages of Tree ensembles compared to Logistic Regression:
It is not required to be linear features (do not expect linear features or even features that interact linearly), such as LR is difficult to deal with categorical features, and Tree Ensembles, which are a collection of decision trees, can easily handle these situations
Due to the process of algorithm construction (bagging or boosting), these algorithms can easily handle high-dimensional data and scenarios with a large amount of training data
RF (Random Forests) vs GBDT (Gradient Boosted Decision Trees)
GBDTwill usually perform better, but they are harder to get right
GBDT has a lot of hyperparameters that can be debugged, and it is easier to overfit, RFs can almost work "out of the box" and that is one reason why they are very popular (I didn't understand this sentence)
Deep Learning
To summarize start from esimple to set a baseline, only make it more complicated if you need to:
1. Start with a simple Logistic Regression and set a baseline
2. Random Forests, (easy to tune)
3. GBDT
4. fancier model
5. deep learning