For deep learning, the right data set and the right model structure are critical. Choosing the wrong data set or the wrong model structure may result in a poorly performing network model, and may even get a non-converged network model. This problem can't be solved by analyzing the data. It can only be achieved by making data sets, building models and conducting simulation experiments again and again to find out how to make the best use of the data set and what kind of model structure to choose.
This article explains some practical knowledge about data sets. Through this article you will learn the following three points:
Explore possible model frameworks;
Develop a set of "views" to systematically test input data;
Ideas in feature selection, feature engineering, and data preparation can generate more perspective on the problem;
1. Problem frame
Model prediction problems in a variety of ways.
The framework of the problem is:
Input
Output
Type of problem
such as
Can I use more or less data as input to the model?
Can you predict something else?
Can I turn the problem into a regression/classification/sequence?
The more creative, the more comprehensive the thinking. In this process, you can draw on ideas from other projects, papers, and fields, or brainstorm. In the previous blog "How to Define Your Machine Learning Problem," I summarized some frameworks for readers' reference.
2. Collect more data
The more data, the better, as long as the data related to the prediction results are ok. Because for a specific task, it is not clear how much data is appropriate. Data is the currency used during the development of the model!
Data is typically spent on the following tasks:
Model training
Model evaluation
Model tuning;
Model verification
In addition, for a new project, there is no practical experience for reference, and everything has to be explored by itself. At this time, the data must be collected so that it can be used in sufficient experimental stages.
3. Research data
It will be possible to visualize the data and to collect the data from all angles.
Viewing the raw data will help;
Viewing summary statistics will help;
Data visualization is useful and will be more visualized to enable more knowledge;
For raw data processing and summary statistics, data visualization, you can spend a lot of time on it. These tasks can help you better understand the data to better select and design the appropriate model.
4. Training data sample size
Use a small amount of data samples for sensitivity analysis to see how much data you actually need. See the blog "How many samples are needed for training in machine learning." In addition, don't think that the more training data, the better, the best is the right one. Therefore, you need to do the following two things:
Design experiments to see how model performance changes with sample size
Use statistics to understand how trends change as the sample size changes
Based on the above two points, we can understand the model performance curve.
5. Feature selection
Create and test many different input feature views because we don't know which variables are helpful for predicting model problems.
Can assume conjecture;
You can use the advice of a domain expert;
Can draw on suggestions from feature selection methods;
All of the above are guesswork and require specific experimental verification.
Design the experiment and carefully test and compare the individual feature views, and use experimental data to tell us which features and which views are the most representative. For more on feature selection, please refer to the blog "Introduction to Feature Selection."
6. Feature Engineering
Use feature engineering to create additional features and views on predictive modeling problems.
Examples include:
Date/time
transaction;
description;
Decompose this data into simpler add-on features such as counts, flags, or other elements. Simplify the modeling process as much as possible. For more on feature engineering, see the blog "Discovering Feature Engineering, How to Design Features, and Using It".
7. Data preparation
The data can be preprocessed in any way that can be thought of to meet the requirements of the algorithm. There are many ways to preprocess, such as feature selection, feature engineering, and creating additional views on input features. Common preprocessing operations such as:
Normalized input features;
Standardized input characteristics;
Smooth the input features;
Apply a data preprocessing method that you can think of, build a new perspective on the problem, and test with one or more sets of models to see how it performs. Our goal is to discover a view of the data and best present the unknown underlying structure of the mapping problem to the learning algorithm.