Work flow and model tuning

Last Update:2016-06-16 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

七月在线4月机器学习算法班课程笔记——No.7

Objective

We know that the process of machine learning is very cumbersome. The previous article introduced the importance and time-consuming feature processing in machine learning, whereas feature processing belonged only to the work of the machine learning pre-order. After the feature engineering, we need to select the machine learning model, cross-validation, and find the best parameters and other modeling steps. After building the model, it also needs to optimize the model, the model tuning is a necessary part of the actual production, but also a matter of continuous improvement.
This article will take a small data set as an example, talk about machine learning in the actual project workflow, how to analyze the model state, analyze the weights, analyze bad-case and how to do model fusion . Ha, open the macro-cognition!

1. Pre-order Workflow 1.1 Data section

Data cleansing: Discard untrusted samples and do not use fields with a lot of default values.
Data sampling: Sample equalization is ensured with lower/upper sampling.

1.2 Feature Engineering

The previous note focused on feature handling and feature selection in feature engineering. Feature processing includes numerical, categorical, temporal, textual, statistical and combinatorial features, including filter, wrap, and embed, which are not mentioned here. The process and method of feature processing are summarized.
　　
　　

1.3 Model Selection

With the training data ready, the machine learning model can be selected based on the features. There are two ways to understand model selection. The first meaning is which model to choose; the second meaning is to determine the model and how to specify the parameters.
　　
　　 1) The first kind of understanding:
The first meaning is which model to choose?
Often there is the question: "I have prepared the data, plan to do classification or continuous value prediction, with what model is better?" "But in fact, none of the models are omnipotent, and each has its own application scenario." Here's a look at Scikit-learn,scikit-learn is an open source site that writes a machine learning library in the Python language. Usually the most difficult part of solving machine learning problems is to find the right estimator, the following flowchart clearly gives the path to solve the problem, enter the Scikit-learn official website, you can click on any estimator, see its documentation.
　　
　　
Analyzing this diagram, the model selection is divided into several steps:
　　
1. Prepare the data and see how large the sample size is.
Very small sample size → It is necessary to collect more data, otherwise it is difficult to obtain the generalization of the relationship, easy to cause overfitting. or use manual rules to solve the problem.
Sufficient sample size → Step 2.
　　
2. The judging problem type is continuous value prediction or discrete value prediction.
Discrete value Prediction → For example, "users will not buy a product", more choice of classification algorithm. Go step 5.
Continuous value Prediction → For example, the house price, the stock market, many choose the regression class algorithm. Go step 6.
　　
3. Judge the dimension: When the dimension is high, we should do some processing, reduce the dimension, save the space resources and speed up the processing speed.
4. Determine the data label.
With tags → Classification issues. Go step 5.
No label → Clustering problems. Clustering algorithms such as K-means and FCM can be used.
　　
5. The classification problem needs to judge the sample magnitude.
Sample magnitude is not too large → Linear SVC. If it is a text categorization problem, it is recommended to use naive Bayes, not text to suggest LR, SVM.
Large sample magnitude → The training time of SVM is long, and the degree of convergence is uncertain. It is recommended to use the random gradient descent algorithm SGD, which may do some nuclear estimation.
6. Continuous problems need to judge the sample magnitude. The linear regression, support vector regression and so on are used to compare the small scale. Relatively large on SGD.
　　
　　 2) The second understanding of:
The second implication is that it has been determined what model to use to solve the problem, but for the same model, there will be many parameters, there are many possibilities. Example of a linear regression:
　　
The Green line represents the distribution of the target results and determines the use of linear regression (generalized polynomial functions < Span class= "Mi" id= "mathjax-span-28" style= "font-family:mathjax_math-italic;" >x m Span style= "Display:inline-block; width:0px; Height:2.456em; " > ) model to fit. When the parameter m is different, the fit situation can vary greatly. M=0 or M=1, there is not enough points to fit, that is, the case of the lack of fitting; when m=9, the curve fits each data point, but the absence of the predicted meaning of the curve is the case of overfitting.
　　

1.4 Cross-validation

For the second meaning of model selection, how to choose parameters, one solution is cross-validation.
In general, you get the data. Regrets are divided into two parts, training sets and test sets. But this approach does not help us set parameters. Thump thump thump, see cross-validation. Cross-validation divides the data into three parts, such as 70% training sets, 20% cross-validation sets, and 10% test sets. The training set is used for modeling, cross-validation set to do parameter/model selection, test set only for model effect evaluation. How to achieve it? The following is the most classic K-fold cross-validation (K-fold crosses validation).
K-Fold cross-validation: The dataset A is randomly divided into k packets, each time one of the packages as a cross-validation set, leaving the K-1 package as a training set for training, the results of the average K or use other combinations, and finally get a single estimate. If k=5, evenly divides the training set into 50 percent fold1~fold5. The process requires 5 rounds of cross-validation, the first round with FOLD1~FOLD4 as the training set, fold5 as the cross-validation set, the second round with FOLD1~FOLD3, FOLD5 as the training set, fold4 as a cross-validation set ... Each round, with different parameters to build a model (such as the example of M in 0, 1, 3, 9 to do the experiment) to obtain the accuracy of each parameter, and finally get the accuracy of the different parameters mean, compare the effect of parameters.
　　

1.5 Finding the best super parameters

Since the choice of parameters has a great influence on the model, we need to make clear the parameter meanings of the model and the influence on the model when we use a model. Raised chestnuts: The logistic regression model-sklearn.linear_model in Sklearn. Logisticregression.
　　
Take a look at the document to understand the meaning of the parameters in the model. For example, the size of C has a great effect on convergence. The effect of the value of C on the accuracy of model classification is explained, and the Lambda is C. When Lambda=1, the graphs are more rounded and have better accuracy.
　　
Cross-validation is selected and can be used with Sklearn.grid_search. The GRIDSEARCHCV function.

2. Model Optimization 2.1 Model state

Through the above model selection, you can determine a model, then how to determine the model is good or bad, you need to examine the state of the model . There are two more important states that are "over-fitting (Overfitting/high variance)" and "Under-fitting (Underfitting/high bias)". Engineering is the first time you need to check the state of the model. In the first chapter there is a legend to illustrate the two states, how to evaluate the model state ?
　　
The first table is the state of under-fitting: When the amount of data is very low, the accuracy of the training set is higher, and the accuracy decreases with the increase of the data volume, but the whole is not reached the desired level.
The second table is the state of overfitting: The accuracy is higher than expected, but the gap between the training result and the verification result is always large, which means that the overfitting state is not good enough to predict the unknown data.
　　
Different model state processing:
1. Overfitting: Finding more data to learn, increasing regularization coefficients, reducing the number of features (not recommended); It is important not to assume that dimensionality reduction solves the problem of fitting.
2. Under-fitting: Find more features and reduce the regularization factor.

2.2 Weight Analysis

For a linear model or a linear kernel model, the weights refer to the parameters θ=[ θ 0 , θ 1 ,..., θ n ] 。 Analysis weights can do more detailed work, you can use the combination of features to adjust the weight, such as the real estate area features are more important, in order to better use this feature, you can adjust the area, to refine the living room area, bedroom area, or square of the area of the characteristics. You can never do weight adjustment by hand.

2.3 Bad-case Analysis

Bad-case analysis is also often done, how to do it?
Classification Problem Bad-case Analysis:
1. What training samples were divided?
2. What part of our feature makes it this decision?
3. Are there any similarities between these bad cases? Bad, for example, is a new product, a material problem?
4. Are there any features that haven't been tapped yet? Are there any new features that are not considered in the wrong sample?
Bad-case Analysis of regression problems: what are the differences between the sample predictions and why
　　

2.4 Model Fusion

Model Fusion is now an important part of company research. For example, the rate-of-CTR model includes LR, GBDT, and so on. The model fusion has the objective advantage: "The human strength is big", "10,000 Hours law".
1. Bagging: Improve the accuracy of learning algorithms.
Thought: Instead of all data sets, each time a subset is trained on a model.
Classification: Using the results of these models to do vote
Regression: Averaging the results of these models
　　　　　　
2. Adaboost: Solve the classification problem. Eliminate some of the unnecessary training data features and place them on key training data.
Thought: Train different classifiers (weak classifiers) for the same training set, and then assemble these weak classifiers to form a stronger final classifier (strong classifier). The algorithm itself is achieved by changing the distribution of data, which determines the weights of each sample based on the correctness of the classification of each sample in each training set and the accuracy of the last population classification. The new data set that modifies the weights is sent to the lower classifier for training, and finally the classifier that is trained at the end of each training is fused as the final decision classifier.
　　
D1 are raw data and need to differentiate between positive and negative samples. Just start with a simple line (weak classification) to split, and then mark the wrong three points, adjust the classifier, get D2 corresponding graph, there are still three points wrong. Then continue to try to split it with another line.
　　
Finally, using the results of these three classifiers to make a fusion, get the correct segmentation. In the end, a simple method is used to get better results.
Gradient boosting tree and adaboost are similar in thinking, solving the problem of regression.

3. Machine learning Complete Case

After the theory content, the cold teacher used one hours to explain the actual project case, interested can see. Specific Engineering case Study: April machine Learning algorithm class-Workflow and model tuning

Model Selection Tutorial Video
Scikit-learn official website

Work flow and model tuning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More