How to get more out of machine learning data

Last Update:2018-10-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For deep learning, the right data set and the right model structure are critical. Choosing the wrong data set or the wrong model structure may result in a poorly performing network model, and may even get a non-converged network model. This problem can't be solved by analyzing the data. It can only be achieved by making data sets, building models and conducting simulation experiments again and again to find out how to make the best use of the data set and what kind of model structure to choose.

This article explains some practical knowledge about data sets. Through this article you will learn the following three points:

Explore possible model frameworks;

Develop a set of "views" to systematically test input data;

Ideas in feature selection, feature engineering, and data preparation can generate more perspective on the problem;

1. Problem frame

Model prediction problems in a variety of ways.

The framework of the problem is:

Input

Output

Type of problem

such as

Can I use more or less data as input to the model?

Can you predict something else?

Can I turn the problem into a regression/classification/sequence?

The more creative, the more comprehensive the thinking. In this process, you can draw on ideas from other projects, papers, and fields, or brainstorm. In the previous blog "How to Define Your Machine Learning Problem," I summarized some frameworks for readers' reference.

2. Collect more data

The more data, the better, as long as the data related to the prediction results are ok. Because for a specific task, it is not clear how much data is appropriate. Data is the currency used during the development of the model!

Data is typically spent on the following tasks:

Model training

Model evaluation

Model tuning;

Model verification

In addition, for a new project, there is no practical experience for reference, and everything has to be explored by itself. At this time, the data must be collected so that it can be used in sufficient experimental stages.

3. Research data

It will be possible to visualize the data and to collect the data from all angles.

Viewing the raw data will help;

Viewing summary statistics will help;

Data visualization is useful and will be more visualized to enable more knowledge;

For raw data processing and summary statistics, data visualization, you can spend a lot of time on it. These tasks can help you better understand the data to better select and design the appropriate model.

4. Training data sample size

Use a small amount of data samples for sensitivity analysis to see how much data you actually need. See the blog "How many samples are needed for training in machine learning." In addition, don't think that the more training data, the better, the best is the right one. Therefore, you need to do the following two things:

Design experiments to see how model performance changes with sample size

Use statistics to understand how trends change as the sample size changes

Based on the above two points, we can understand the model performance curve.

5. Feature selection

Create and test many different input feature views because we don't know which variables are helpful for predicting model problems.

Can assume conjecture;

You can use the advice of a domain expert;

Can draw on suggestions from feature selection methods;

All of the above are guesswork and require specific experimental verification.

Design the experiment and carefully test and compare the individual feature views, and use experimental data to tell us which features and which views are the most representative. For more on feature selection, please refer to the blog "Introduction to Feature Selection."

6. Feature Engineering

Use feature engineering to create additional features and views on predictive modeling problems.

Examples include:

Date/time

transaction;

description;

Decompose this data into simpler add-on features such as counts, flags, or other elements. Simplify the modeling process as much as possible. For more on feature engineering, see the blog "Discovering Feature Engineering, How to Design Features, and Using It".

7. Data preparation

The data can be preprocessed in any way that can be thought of to meet the requirements of the algorithm. There are many ways to preprocess, such as feature selection, feature engineering, and creating additional views on input features. Common preprocessing operations such as:

Normalized input features;

Standardized input characteristics;

Smooth the input features;

Apply a data preprocessing method that you can think of, build a new perspective on the problem, and test with one or more sets of models to see how it performs. Our goal is to discover a view of the data and best present the unknown underlying structure of the mapping problem to the learning algorithm.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to get more out of machine learning data

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support