A method to improve the accuracy of machine learning algorithms
When our machine learning algorithms do not accurately predict our test data, we can try to improve the accuracy of our machine learning algorithms by the following methods
1), get more training examples
2), reduce the number of features
3), increase the number of features
4), add polynomial features
5), increase or decrease \ (\lambda\)
Second, evaluate the machine learning model
If we just use a training set alone, we are not very good at evaluating the machine the algorithm is not accurate, because it may be over-fitting (Overfitting), we can divide the test set into two datasets
Take 70% as a training set, 30% as a test set
1), using the training set to learn, get make \ (J (\theta) \) the smallest \ (\theta\)
2), using the test set to evaluate the accuracy of the algorithm
Methods for evaluating the accuracy of algorithms
1), linear regression, \ (J_{test} (\theta) = \dfrac{1}{2m_{test} \sum_{i=1}^{m_{test}} (H_\theta (x^{(i)}_{test})-y^{(i)}_{test}) ^ 2\)
2), logistic regression, \ (Err (H_\theta (x), y) = \begin{matrix} 1 & \mbox{if} h_\theta (x) \geq 0.5\ and\ y = 0\ or\ h_\theta (x) < 0. 5\ and\ y = 1\newline 0 & \mbox otherwise \end{matrix}\)
\ (\text{test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err (H_\theta (x^{(i)}_{test}), y^{(i)}_{test}) \)
Three, the choice of machine learning algorithm model
If you have more than one machine learning algorithm model to choose from, you can divide the dataset into three parts, 60% training sets, 20% cross-validation, 20% test sets
1), using the training set to learn, get each model to make \ (J (\theta) \) the smallest \ (\theta\)
2), select the model that minimizes the test error of the cross-validation set
3), using the test set to evaluate the second step of the selected model of the generalization error to see if it meets our requirements
Four, deviation (Bias or underfitting) and variance (Variance or Overfitting)
How can we improve the accuracy of our models when our machine learning model does not meet our requirements? Although there are many methods, but can not be tried in turn, all methods either solve high variance or solve high deviations, so we first determine whether our model is high deviation or high variance
In linear regression, when we increase the maximum d of the assumed function square feature x, the deviations and variances are as shown in the change, high deviation when \ (j_{train}^{(\theta)} \approx j_{cv}^{(\theta)} \), Gaofangcha when \ (j_{cv}^{(\ Theta)}\) much larger than \ (j_{train}^{(\theta)} \)
Machine Learning Public course note Fifth Week optimization machine learning algorithm