10 common mistakes in Data Mining

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0. Lack of data (lack data)
1. Focus on training)
2. rely only on one technology (rely on one Technique)
3. An error is raised (ask the wrong question)
4. Only rely on data to talk (Listen (only) to the data)
5. Use future information (accept leaks from the future)
6. Discard the case that should not be ignored (discount pesky Cases)
7. Trust in prediction)
8. Try to answer all questions (answer every inquiry)
9. Sample (sample casually)
10. believe the best model)

0. Lack of data (lack data)

For classification or estimation problems, there is often a lack of accurate labeling cases.

For example:
-Fraud detection (Fraud detection): In millions of transactions, there may be only a handful of fraud transactions, and many other fraud transactions are not correctly marked. This requires a lot of effort to correct them before modeling.
-Credit Rating (Credit Scoring): Long-term tracking of potential high-risk customers (for example, two years) is required to accumulate sufficient scoring samples.

1. Focus on training)
Idmer: As physical training pays more and more attention to practical training, because closed training is often brave and messy during training.

In fact, only the model scoring results on off-sample data are truly useful! (Otherwise, use the reference table directly !)

For example:
-Cancer detection (Cancer Detection):MD AndersonDoctors and researchers (1993) Using Neural Networks for cancer testing, I was surprised to find that the longer the training time (from a few days to weeks), the slight improvement in the performance of the training set, however, the performance in the test set is significantly reduced.
-Machine learning or computer science researchers often try to make models best on known data, and such results usually lead to over-fitting (Overfit).

Solution:
A typical method to solve this problem is to re-sample (Re-sampling). Heavy sampling techniques include:Bootstrap, Cross-validation, Jackknife, leave-one-out... and so on.

2. rely only on one technology (rely on one Technique)
Idmer: This error and10Errors. Without comparison, there is no such thing as good or bad, and the idea of dialectics is embodied here.

"When a child holds a hammer, the whole world looks like a nail ." A complete toolbox is needed to make your work perfect.
Do not simply trust the results of your analysis using a single method. At least make a comparison with traditional methods (such as linear regression or linear discriminant analysis.

Results: According to the statistics in the Journal of neural network3Only1/6OfArticle. That is to say, an open set test is conducted on a test set independent of the training sample, and compared with other widely used methods.

Solution:
Use a series of good tools and methods. (Each tool or method may bring up5% ~ 10%).

3. An error is raised (ask the wrong question)
Idmer: Generally in classificationAlgorithmThe classification accuracy will be given as a standard to measure the model quality, but we hardly look at this indicator in actual projects. Why? Because that is not our goal.

A)Project goals: Be sure to lock the correct goals

For example:
Fraud detection (focus on positive examples !) (ShannonLaboratory analysis on international long distance calls): do not try to classify fraud and non-fraud behaviors in general calls. Focus on how to describe the characteristics of normal calls, then, abnormal calls are detected.

B)Model goals: let the computer do what you want it to do

Most researchers will indulge in model convergence to minimize errors so that they can gain mathematical beauty. But what computers should do should be how to improve their businesses, rather than simply focusing on the accuracy of model computing.

4. Only rely on data to talk (Listen (only) to the data)
Idmer: There is no error in "Making data speak". The key is to remember another sentence: both listening and listening are clear, and partial listening is dark! If data + tools can solve the problem, what else do people need to do?

4A.Opportunistic data: data itself can only help analysts find out what is remarkable, but it cannot tell you whether the result is correct or wrong.
4b.After the design of the experiment: some experimental design with human components, such experimental results are often untrusted.

5. Use future information (accept leaks from the future)
Idmer: It seems impossible, but it is a very easy mistake, especially when you are facing thousands of variables. Being careful, careful, and organized is the basic requirement of data mining personnel.

Forecast (Forecast) Example: forecast the interest rate of Bank of Chicago on a certain day, using neural network modeling, the accuracy of the model is achieved95%. However, the daily interest rate is used as the input variable in the model.
Forecast example in Financial Industry: Use3The daily moving average is used to forecast, but the point of the moving average is set today.

Solution:
You need to carefully check the variables that make the Results Abnormal. These variables may not be used or should not be used directly.
Add a timestamp to the data to avoid misuse.

6. Discard the case that should not be ignored (discount pesky Cases)
Idmer: Is it "better than chicken heads, not chicken tails", or "hiding in the city, hiding in the Wild "? Different attitudes can have the same wonderful life, and different data may have the same important value.

Abnormal Values may lead to incorrect results (such as incorrect decimal point in the price), but may also be the answer (such as the ozone hole ). Therefore, you need to carefully check these exceptions.
The most exciting words in the study are not "Aha !", But "this is a bit strange ......"
Inconsistency in data may be a clue to solve the problem. Digging deeper may solve a big business problem.

For example:
In direct mail marketing, the data found during the merger and cleaning of residential addresses is inconsistent, but may be a new marketing opportunity.

Solution:
Visualization helps you analyze whether a large number of assumptions are true.

7. Trust in prediction)
Idmer: Still the viewpoint in dialectics, things are constantly evolving and changing.

People often draw conclusions easily when there is little experience.
Even if some counterexamples are found, people are reluctant to give up their original ideas.
Dimension Manipulation: low-dimensional intuition is often meaningless when placed in a high-dimensional space.

Solution:
Evolution. There are no correct conclusions, but more and more accurate conclusions.

8. Try to answer all questions (answer every inquiry)
Idmer: It's a bit like a phrase I encouraged myself when I climbed the mountain. "I don't know when I can climb the mountain, but I know that one step is closer to the end ."

"Unknown" is a meaningful model result.
Model may not work100%An accurate answer to a question can at least help us estimate the possibility of a certain result.

9. Sample (sample casually)

9AReduce the sampling level. For example,MDDirect Mail conducts Response Prediction and Analysis, but finds that the proportion of non-responding customers in the dataset is too high (a total of 1 million direct mail customers, of which more99%). As a result, the modeler performs the following sampling: put all the responders into the sample set, and then perform system sampling among all the responders, that is, every10A person draws a sample set until the sample set reaches10Tens of thousands. But the model has come up with the following rules: anyone who lives in Ketchikan, Wrangell, and Ward Cove Alaska will respond to marketing. This is obviously a problematic conclusion. (The problem lies in this sampling method, because the original dataset has been sorted by zip code, and none of the above three regions have been extracted to the sample set, so we come to this conclusion ).

Solution: "shake before drinking !" First, the order in the original dataset is disrupted to ensure random sampling.

9BImprove the sampling level. For example, in credit scoring, because the percentage of default customers is usually very low, the proportion of default customers is often increased manually in modeling (such as increasing the weight of these default customers ).5Times ). During modeling, it is found that as the model becomes more complex, the accuracy of identifying default customers is also getting higher and higher, but the false positive rate for normal customers also increases. (The problem lies in the division of datasets. When the original dataset is divided into a training set and a test set, the weight of the default customers in the original dataset has been increased)

Solution: divide data sets first, and then increase the weight of default customers in the training set.

10. believe the best model)
Idmer: The old saying-"no best, only better !"

Interpretability is not always necessary. A model that does not seem completely correct or can be interpreted is sometimes useful.
Some variables used in the "best" model will distract people from too much attention. (Non-Interpretability is sometimes an advantage)
In general, many variables look very similar to each other, and the structure of the best model looks very different and traceable. However, it should be noted that similar structures do not mean similar functions.

Solution: Installing multiple model sets may bring better and more stable results.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

10 common mistakes in Data Mining

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support