The original title is "Top Data Mining Mistakes", the author is John F. Elder IV, Ph.D.
Compiled by: Idmer (data miner)
Http://www.salford-systems.com/doc/elder.pdf
According to Dr Elder's summary, the 10 most fallible errors include:
0. Lack of information (Lack data)
1. Pay attention to training (focus on Training)
2. Rely on only one technology (Rely on a technique)
3. Wrong question (ask the wrong Question)
4. Talk by data only (Listen)
5. Use of future information (Accept Leaks from the next)
6. Discard the case that should not be neglected (Discount pesky Cases)
7. Credulous predictions (extrapolate)
8. Attempt to answer all questions (Answer every Inquiry)
9. Random sampling (sample casually)
10. Trust the best model (believe)
0. Lack of information (Lack data)
For categorical or predictive issues, there are often no cases of accurate labeling.
For example:
-Fraud Detection (fraud Detection): In millions of transactions, there may be only a handful of fraudulent transactions, and a lot of fraudulent transactions are not properly labeled, which requires a lot of manpower to correct before modeling.
-Credit scoring: long-term tracking of potential high-risk customers (for example, two years) to accumulate enough scoring samples.
1. Pay attention to training (focus on Training)
Idmer: Like physical training more and more focus on actual combat training, because the simple closed training will often be trained in the state of heroic, the game is a mess.
In fact, only the model score results on the off-sample data are really useful! (Otherwise, just use the reference table!) )
For example:
-Cancer detection (cancer detection): MD Anderson's Doctors and researchers (1993) used a neural network for cancer detection, and was surprised to find that the longer the training time (from days to weeks), the performance improvement on the training set was very slight, However, the performance on the test set has decreased significantly.
-Machine learning or computer science researchers often try to make models perform optimally on known data, and the results often result in overfitting (Overfit).
Workaround:
The typical way to solve this problem is to Resample (re-sampling). Re-sampling techniques include: Bootstrap, Cross-validation, Jackknife, leave-one-out ... Wait a minute.
2. Rely on only one technology (Rely on a technique)
Idmer: This error has something to do with the 10th error, please refer to its solution at the same time. There is no comparison there is no so-called good or bad, the thought of dialectics is embodied in this.
"When a child is holding a hammer, the whole world looks like a nail. "For the job to be perfect, a complete toolkit is needed.
Do not simply rely on the results of a single method analysis, at least with traditional methods (such as linear regression or linear discriminant analysis) to do a comparison.
Research results: According to the Journal of Neural Network, in the past 3 years, only 1/6 of the articles have achieved the above two points. That is, open sets are tested on test sets that are independent of the training sample and are compared with other widely used methods.
Workaround:
Use a range of good tools and methods. (Each tool or method can bring up to 5%~10% improvements).
3. Wrong question (ask the wrong Question)
Idmer: Generally in the classification algorithm will give the classification accuracy as a measure of the quality of the model standard, but in the actual project we almost do not look at this indicator. Why? Because that's not the goal we're focused on.
A) objective of the project: Be sure to lock in the right target
For example:
Fraud Detection (the focus is on positive cases!) (Shannon Laboratory analysis of International long-distance calls): Do not attempt to classify fraudulent and non-fraudulent behavior in general calls, focusing on how to describe the characteristics of a normal call, and then find abnormal call behavior accordingly.
b) The goal of the model: let the computer do what you want it to do
Most researchers will indulge in the convergence of the model to minimize errors so that they can gain a mathematical aesthetic. But what should be done to the computer should be how to improve the business, rather than just focus on the accuracy of the model calculations.
4. Talk by data only (Listen)
Idmer: "Let the data speak" is not wrong, the key is to remember another sentence: Instead, listen to the dark! If data + tools can solve the problem, what else do you have to do?
4a. Opportunistic data: The data itself can only help analysts find out what is a significant outcome, but it doesn't tell you whether the result is right or wrong.
4b. Designed experiments: Some experimental designs are doped with artificial ingredients, and the results are often unreliable.
5. Use of future information (Accept Leaks from the next)
Idmer: Seemingly impossible, but it is very easy to make mistakes in practice, especially when you face thousands of variables. Serious, careful and orderly are the basic requirements of data mining personnel.
Forecast (Forecast) Example: Forecast Chicago Bank at one day interest rate, using neural network modeling, the model accuracy rate reached 95%. In the model, however, the interest rate for that day is used as the input variable.
Examples of forecasts in the financial industry: using the moving average of 3rd to forecast, but the midpoint of the moving average is set today.
Workaround:
To take a closer look at the variables that make the results behave exceptionally well, these variables may not be used, or should not be used directly.
Time stamp the data to avoid misuse.
6. Discard the case that should not be neglected (Discount pesky Cases)
Idmer: In the end is "better for the chicken, not for the Phoenix", or "big faint in the city, small faint in the wild"? Different life attitudes can have the same wonderful life, different data may also contain the same important value.
Outliers can lead to incorrect results (such as the wrong decimal point in the price), but may also be the answer to the question (such as the ozone hole). So you need to check these exceptions carefully.
The most exciting words in the study are not "Aha!" "But" it's a little strange ... "
Inconsistencies in the data can be a clue to the problem, and digging deep down may solve a big business problem.
For example:
In direct mail marketing, the data found during the merging and cleaning of home addresses is inconsistent, but may be a new marketing opportunity.
Workaround:
Visualization can help you analyze a large number of assumptions whether they are true or not.
7. Credulous predictions (extrapolate)
Idmer: Still is the viewpoint of dialectics, things are constantly developing and changing.
People often draw conclusions easily when they have little experience.
Even if some counter-examples are found, people are reluctant to give up their original ideas.
Dimension spells: Intuition in a low dimension is often meaningless in high-dimensional space.
Workaround:
Evolution. There is no correct conclusion, only more and more accurate conclusion.
8. Attempt to answer all questions (Answer every Inquiry)
Idmer: A little bit like the words I encourage myself when I climb a mountain "I don't know when I can get to the mountain, but I know it's a step closer to the finish." ”
"Don't know" is a meaningful model result.
The model may not be able to answer the question 100% accurately, but at least it can help us estimate the likelihood of a certain outcome.
9. Random sampling (sample casually)
9a reduces the sampling level. For example, the MD Direct Mail company responds to predictive analytics, but finds that the non-responsive accounts in the data set are too high (a total of 1 million direct mail customers, more than 99% of whom are not responding to marketing). The modeler then took the following sample: Put all the responders into the sample set, and then sample the system in all the non-responders, that is, every 10 people draw one into the sample set, until the sample set reaches 100,000 people. But the model actually came up with the following rules: Anyone who lives in Ketchikan, Wrangell and Ward Cove Alaska will respond to marketing. This is clearly a question of conclusion. (The problem is in this sampling method, because the original dataset has been sorted according to the ZIP code, the above three regions where the non-respondents could not be extracted to the sample set, so this concludes).
Solution: "Shake before you drink!" "The order of the original data set is disturbed to ensure the randomness of the sampling."
9b raise the sampling level. For example, in credit scoring, because the percentage of defaulting clients is generally very low, modeling often tends to artificially increase the proportion of defaulting customers (for example, by 5 times times the weight of those defaulting clients). Modeling shows that as the model becomes more and more complex, the accuracy of discriminating clients is more and more high, but the rate of miscarriage of normal customers increases. (The problem is the partitioning of the data set.) When the original data set is divided into training and test sets, the weights of default customers in the original dataset have been increased.
Workaround: First the data set division, and then improve the training centralized default customer weight.
10. Trust the best model (believe)
Idmer: Or the old saying-"No best, only better!" ”
Interpretative sex is not necessarily always necessary. Models that do not seem to be completely correct or can be explained are sometimes useful.
Some of the variables used in the "best" model can distract people from too much attention. (Non-explanatory is sometimes an advantage)
In general, many variables appear to be very similar to each other, and the structure of the best model looks very different and has no trace to follow. It should be noted, however, that structural similarity does not imply functional similarity.
Workaround: Loading multiple model sets may lead to better and more stable results.
Reprint Address: http://mp.weixin.qq.com/s?__biz=MjM5NDE4MTc2OA==&mid=200096540&idx=1&sn= 0ab290940c92ca97f109dd44973094ed&mpshare=1&scene=1&srcid=1011qqdjr60eyzt6gxncfkai#rd
Several big mistakes in data mining are "reproduced, invaded and deleted"