Common pitfalls in machine learning projects

Source: Internet
Author: User


http://blog.jobbole.com/86131/

Common pitfalls in machine learning projects

2015/04/22 · It technology · Machine learning

share to:7
    • Oracle Technology Carnival
    • Java Implementation Picture watermark
    • Learn to write a word
    • Front-end performance optimization-Basic knowledge cognition
This article by Bole Online-ruan.answer translation, Daetalus School Draft. without permission, no reprint!
English Source: Machinelearningmastery. Welcome to join the translation team.

In a recent report, Ben Hamner introduced a common misconception about some of the machine learning programs that he and his colleagues saw in the Kaggle game.

The report, held in Strate in February 2014, is called machine learning Elf.

In this article, we'll look at some of the common myths in Ben's report, what they are and how to avoid falling into these myths.

The process of machine learning

Before the report, Ben showed us a general process for solving machine learning problems.

Machine learning process, excerpt from Ben Hamner's machine learning Elf

This process includes the following 9 steps:

    1. Start with an industry problem
    2. Source data
    3. Slicing data
    4. Select an evaluation criterion
    5. Perform feature extraction
    6. Training model
    7. Feature Selection
    8. Model selection
    9. Production systems

Ben emphasizes that the process is iterative and non-linear.

He also talked about possible errors in every step of the process, each of which could make it difficult for the entire machine learning process to achieve the desired results.

Identifying Dogs and cats

Ben proposed a study to build a "cat door" case, the "door" open to the cat and closed to the dog. This is an illuminating example, as it is designed to address a number of key issues in dealing with data issues.

Identifying dogs and cats, excerpted from Ben Hamner's machine learning Elf

Sample size

The first selling point of this example is that the accuracy of the model learning is related to the size of the data sample and shows the relationship between more samples and better accuracy.

He continued to increase the training data until the model accuracy stabilized. This example will give you a good idea of how sensitive your system is to sample size and its corresponding adjustments.

The wrong question

The second selling point was that the system failed, and it was shut out of all cats.

This example highlights the importance of understanding the constraints of the problems we need to solve, rather than focusing on the problems you want to solve.

Misunderstandings in machine learning engineering

Ben went on to discuss 4 common pitfalls in solving machine learning problems.

Although these problems are very common, he points out that they are relatively easy to identify and solve.

Over fitting, excerpted from Ben Hamner's machine learning Elf

    • Data disclosure: Leverage data that cannot be accessed by production systems in the model. This problem is particularly common in timing problems. It can also happen on data like the system ID, which may represent a class label. Run the model and carefully review the features that will help the system. Examine it thoroughly and consider whether it is meaningful. (Inspection reference paper "Leaks in data mining" | Leakage in Data Mining ").
    • Overfitting: Modeling on training data is too precise, and there are some noise points in the model. At this point, overfitting reduces the scale of the model, much more in higher dimensions and more complex class boundaries.
    • Data adoption and segmentation: With respect to data disclosure, you need to be very careful to know whether the training, testing, and cross-checking datasets are truly independent datasets. For timing problems, many ideas and work need to ensure that the system can reply to the data in chronological order and verify the accuracy of the model.
    • Data quality: Check the consistency of your data. Ben gave a flight takeoff and landing site data, a lot of inconsistent, repetitive and erroneous data need to be identified and explicitly processed. This data can directly damage the modeling and model expansion capabilities.
Summarize

Ben's "Machine Learning Elf" is a quick and practical report.

You'll get a useful crash study of common misconceptions about machine learning, and these techniques can be easily used to work with data.

Common pitfalls in machine learning projects

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.