Machine learning done wrong

Last Update:2015-07-26 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Machine learning done wrong

Statistical modeling is a IoT like engineering.

In engineering, there is various ways to build a key-value storage, and each design makes a different set of assumptions About the usage pattern. In statistical modeling, there is various algorithms to build a classifier, and each algorithm makes a different set of a Ssumptions about the data.

When dealing with small amounts of data, it's reasonable to try as many algorithms as possible and to pick the best one SI nCE the cost of experimentation was low. But as we hits "Big Data", it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, M odeling, optimization algorithm, evaluation, productionization) accordingly.

As pointed out in my previous post, there is dozens of ways to solve a given modeling problem. Each model assumes something different, and it's not obvious how to navigate and identify which assumptions is reasonable . In industry, most practitioners pick the modeling algorithm they is most familiar with rather than pick the one which BES T suits the data. In this post, I would like to share some common mistakes (the don ' t-s). I ' ll save some of the best practices (the Do-s) in a future post.

1. Take the default loss function for granted

Many practitioners train and pick the best model using the default loss function (e.g., squared error). In practice, off-the-shelf loss function rarely aligns with the business objective. Take the fraud detection as an example. When trying to detect fraudulent transactions, the business objective was to minimize the fraud loss. The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally. To align with the business objective, the loss function should not only penalize false negatives more than false positives , but also penalize each false negative in proportion to the dollar amount. Also, data sets in fraud detection usually contain highly imbalanced labels. In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

2. Use plain linear models for non-linear interaction

When building a binary classifier, many practitioners immediately jump to logistic regression because it's simple. But, many also forget that logistic regression are a linear model and the non-linear interaction among predictors need to B e encoded manually. Returning to fraud detection, high order interaction features like "Billing address = Shipping address and transaction AMO UNT < $ "is required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers this bake in Higher-order interactio N Features.

3. Forget about outliers

Outliers is interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue is observed, it's probably a good idea-to-pay extra attention-them and figure out what CA used the spike. But if the outliers is due to mechanical error, measurement error or anything else that's not generalizable, it's a good idea to filter out these outliers before feeding the data to the modeling algorithm.

Some models is more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as ' hard ' cases and put tremendous weights on outliers while decision TR EE might simply count each outlier as one false classification. If The data set contains a fair amount of outliers, it ' s important to either use modeling algorithm robust against outlier s or filter the outliers out.

4. Use high variance model when n<<p

SVM is one of the most popular off-the-shelf modeling algorithms and one of their most powerful features are the ability to f It the model with different kernels. SVM kernels can be thought of as a-to-automatically combine existing features to form a richer feature space. Since This power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data had n<<p (number of samples << number of features)--common in industries like Medica L data-the richer feature space implies a much higher risk to overfit the data. In fact, the high variance models should is avoided entirely when n<<p.

5. l1/l2/Regularization without standardization

Applying L1 or L2 to penalize large coefficients are a common to regularize linear or logistic regression. However, many practitioners is not aware of the importance of standardizing features before applying those regularization .

Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient are going to be around 100 Times larger than the fitted coefficient if the unit were in cents. With regularization, as the l1/l2 penalize larger coefficient more, the transaction amount would get penalized more if th E unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.

6. Use linear model without considering multi-collinear predictors

Imagine building a linear model with both variables X1 and X2 and suppose the ground Truth model is y=x1+x2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth . However, if X1 and X2 is collinear, to most of the optimization algorithms ' concerns, y=2*x1, y=3*x1-x2 or y=100*x1-99*x2 is all as good. The problem might not being detrimental as it doesn ' t bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance

Because Many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe this for L Inear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. Rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if feature S is multi-collinear, coefficients can shift from the one feature to others. Also, the more features the data set have, the more likely the features is multi-collinear and the less reliable to Interp RET the feature importance by coefficients.

So there your go:7 common mistakes when doing ML in practice. This list isn't meant to being exhaustive but merely to provoke the reader to consider modeling assumptions Applicable to the data at hand. To achieve the best model performance, it's important to pick the modeling algorithm that makes the most fitting assumpti ONS-not just the one you ' re most familiar with.

If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to my blog "ML in the Valley". Also, Special thanks Ian Wong (@ihat) for reading a draft of this.

Cheng-tao Chu

Director of Analytics at Codecademy. Specialties:data Engineering and machine learning. Formerly:google, LinkedIn and Square.

Machine learning done wrong

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More