Machine learning Algorithms Study Notes
Gochesong
@ Cedar Cedro
Microsoft MVP
This series is the learning note for Andrew Ng at Stanford's machine learning course CS 229.
Machine learning Algorithms Study Notes Series article Introduction
3 Learning Theory
3.1 Regularization and model selection
model selection Problem: There are several models to choose from for a learning problem. For example, to fit the sample points of a group, you can use linear regression or polynomial regression. What is the best model to use (to achieve a balance between deviations and variances)?
There is also a class of parameter selection problems: If we want to use a regression model with weights, then how to choose the parameters in the weight w formula?
Formal definition: Assuming an optional set of models is, for example, we want to classify, then SVM, logistic regression, neural networks and other models are included in M.
3.1.1 Cross Validation
Our first task is to choose the best model from M.
Assuming that the training set uses S to represent, if we want to use empirical risk minimization to measure the quality of the model, then we can choose the model:
- Using S to train each one, and after training the parameters, we can get the hypothesis function. (for example, when a linear model is obtained, a hypothetical function is obtained)
- Select the assumption function with the least error rate.
|
Unfortunately, this algorithm is not available, for example, we need to fit some sample points, the use of higher-order polynomial regression is certainly less than the linear regression error rate, the deviation is small, but the variance is very large, will be over-fitted. Therefore, we improved the algorithm as follows:
- A sample of 70% is randomly selected from all the training data s as the training set, and the remaining 30% as the test set.
- In training each one, get the assumption function.
- On the test each one, get the corresponding experience error.
- Select as the best model with minimal experience errors.
|
This method is referred to as Hold-out Cross validation or as simple crossover validation.
Because the test set is and the training set is two worlds, we can assume that the empirical error here is close to a generalization error (generalization error). The scale of the test set here generally accounts for the 1/4-1/3 of all data. 30% is a typical value.
The model can also be improved, when the best model is chosen, and then a training on all data s, obviously the more training data, the more accurate model parameters.
The weakness of the simple cross-validation approach is that the best model is the one that is picked up in 70% of the training data and does not represent the best of all training data. And when the training data is very rare, after the test set, the training data is too little.
We will make another improvement to the simple cross-validation method as follows:
- Dividing all the training set S into K disjoint subsets, assuming that the number of training samples in S is M, then each subset has a m/k training sample, and the corresponding subset is called {}.
- Each time you take one from the model set M and then select K-1 {} In the training sub-set (that is, leave only one at a time), use this k-1 subset to train and get the assumed function. Finally, use the rest of the test to get experience errors.
- Since we leave one at a time (J from 1 to K), we get a K experience error, and for one, its experience error is the average of this K experience error.
- Choose the least average experience error rate, then use the whole s to do one more training to get the final.
|
This method is known as K-fold Cross validation (K-fold crossover validation). To put it bluntly, this method is to change the simple cross-validation test set to 1/k, and each model trains k times, tests K times, and the error rate is the average of k times. Generally speaking K value is 10. This is basically possible when the data is sparse. Obviously, the disadvantage is that there are too many training and testing times.
In extreme cases, K can have a value of M, meaning that each time a sample is tested, this is called Leave-one-out Cross validation.
If we invent a new learning model or algorithm, then cross-validation can be used to evaluate the model. In NLP, for example, we focus our training on part of the training and part of the test.
Reference documents
[1] machine learning Open Class by Andrew Ng in Stanford http://openclassroom.stanford.edu/MainFolder/CoursePage.php? Course=machinelearning
[2] Yu Zheng, Licia Capra, Ouri Wolfson, Hai Yang. Urban computing:concepts, methodologies, and applications. ACM Transaction on Intelligent Systems and technology. 5 (3), 2014
[3] Jerry lead http://www.cnblogs.com/jerrylead/
[4] Big data-massive data mining and distributed processing on the internet Anand Rajaraman,jeffrey David Ullman, Wang Bin
[5] Ufldl Tutorial http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial
[6] Spark Mllib's naive Bayesian classification algorithm http://selfup.cn/683.html
[7] mllib-dimensionality Reduction http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
[8] Mathematics in machine learning (5)-powerful matrix singular value decomposition (SVD) and its application http://www.cnblogs.com/LeftNotEasy/archive/2011/01/19/svd-and-applications.html
[9] Discussion on the implementation of linear regression algorithm in Mllib http://www.cnblogs.com/hseagle/p/3664933.html
[10] Maximum likelihood estimation http://zh.wikipedia.org/zh-cn/%E6%9C%80%E5%A4%A7%E4%BC%BC%E7%84%B6%E4%BC%B0%E8%AE%A1
[One] deep learning Tutorial http://deeplearning.net/tutorial/
Machine learning Algorithms Study Notes (3)--learning theory