Machine Learning notes of the Dragon Star program
Preface
In recent weeks, I spent some time learning the machine learning course of the Dragon Star program for the next summer vacation. For more information, see the appendix. This course chooses to talk about the basic model in ML. It also introduces popular and new algorithms in recent years. In addition, it also combines ML theory with actual problems, for example, it is applied to visual and web applications. In short, although the course content is not very detailed (after all, there are only a few classes), the content is relatively new and comprehensive. After these lessons, I learned a lot. At least I learned a little bit about my skills. Below are some simple notes in the course.
Lesson 1 Introduction
In machine learning, there are three indispensable elements, Data, models, and algorithms. Nowadays, data sources are widely used, and T-level data can be generated every day. The model is the various models that need to be studied in the machine learning course. The algorithm is how to learn the parameters in the model through data and models. However, in the classroom, Mr. Yu proposed that these three elements are not important. The most important thing is the demand. Once there is a need, various methods will be used to solve the problem. He is Baidu's Deputy Technical Director. In addition, the main application scenarios of machine learning include computer vision, speech recognition, natural speech processing, search, recommendation system, self-driving, and Q & A system.
Class 1 Linear Model
The linear regression model needs to solve the following three problems:
1. How to estimate linear model parameters from training data? That is, intercept and slope.
2. What is the performance of the learned linear model? Can we find a better model?
3. How to estimate the importance of two parameters in the model?
Solving the 1st problems is an optimization problem, that is, obtaining the parameter that minimizes the loss function. Here, the loss function is a square term, also known as the linear least square theory. The linear model expression is:
The noise parameter is a Gaussian noise with a mean value of 0. If the noise obtained later is not a random variable with the mean value of 0 and the variance is the same as that of Gaussian distribution, the model can be improved. For example, map x to a non-linear function first, and then use the least square method to perform linear regression for the non-linear function. As for how to obtain the non-linear ing function f (x), it is automatically obtained either through human observation or through feature learning in machine learning. <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + CqGhoaG4/Lnj0uW1xM/release/B/JxNzKx8/f0NS1xKGjz9/release + 01/release + dChoaPL + release + 6 + zb/release + release/ywb/release/ keys + keys/keys + 48L3N0cm9uZz48L3A + keys/c/keys + CqGhoaE8aW1nIHNyYz0 = "http://www.2cto.com/uploadfile/Collfiles/20140914/20140914092658182.png" alt = "\">
The more complex the model is, the more unstable the model is. the smoother the learned target function is, the more likely it is to over-fitting. Therefore, you need to control the complexity of the model. Generally, there are two ways to reduce the number of parameters in the model or the space size of parameters. Currently, the most commonly used method is to reduce the space size of parameters, it is achieved through rule items. When a rule item is introduced, an adjustment parameter is also required. The parameter size is generally obtained through cross-validation. If the rule item is twice, it is also called ridge Regression. If the rule item is once, it is called lasso regression. Ridge Regression has the advantage of stable solution and allows the number of parameters to be greater than the number of samples. Lasson regression has the advantage of sparse solution, but the solution is not necessarily stable.
If the number of parameters is greater than the number of samples, the number of parameters cannot be normalized. Instead, the parameter space can be reduced. In this way, it is statistically robust to the large number of feature sets, at the same time, the equation solution is stable in numerical calculation.
Class 1 linear classifier
A good understanding of linear classifier can understand many ml concepts and non-linear problems. Linear classifier is the most useful model in practical application.
According to instructor Yu, artificial neural networks have started to heat up since, mainly in the field of deep learning.
Svm theory is perfect and widely used. Likewise, logistic regression is widely used, similar to svm.
When the data is big sample data, the linear SVM model is better.
Lesson 1 nonlinear svm
RKHS representation theorem: The model parameter is a linear combination of training samples in the linear subspaces of training samples. This applies not only to svm, but also to other models, such as perception machines, RBF Networks, LVQ, boosting, and logistic regression.
Kernel is a measure that represents the similarity of two values. You can better understand regularization through core functions. The target functions to be optimized can be written as parameters, dual parameters and non-parameter parameters. If the rule items are controlled by the learned function f (x) in a non-parameter form, its model is inversely proportional to the feature coefficient of the corresponding kernel function for feature function decomposition. That is, the feature coefficients corresponding to the non-principal component functions in feature Function Decomposition are small, and the penalty is large, which will be more restrained. Therefore, we retain the feature functions of principal components. We can see from the above that the kernel function has a certain structure, which determines what the final target function f (x) looks like.
The difference between logistic regression and svm is that the loss function is different. The loss function of logstic regression is a logstic function, and the loss function of kernel svm is hinge loss. The two have the same performance. Logistic regression is a probability-based output, which is easier to use for multiclass classification problems. But currently, both methods are old.
LVQ is a model-based supervised learning classifier.
So what loss functions should we consider when designing a model? What kind of basic function h (x) is used )? Is h (x) finite or infinite? Do you need to learn h (x )? What method is used to optimize the target function, such as QP, LBFGS, or gradient descent?
In theory, the kernel theory can be used to implement the learning of infinite space with limited computation. However, due to the complexity of the actual problem, the number of samples N is the 3rd power, so when there is a lot of sample data, basically, it cannot be implemented.
The difference between a parameter model and a non-parameter model is not to check whether there are parameters in the model. All models have parameters. A non-parameter model means that as the number of samples increases, the number of parameters in the model also increases. Otherwise, the parameter model is used. Common non-parameter models include Gaussian process, kernel svm, and dirichlet process.
6th course Model Selection
Model Selection is very useful in practical application. Generally, model-related data is divided into three parts: training data, verification data, and test data, as shown in:
The system can automatically learn the features of these samples, rather than relying on human design. How attractive it sounds! This is more like AI. Deep learning is mainly used to determine the hierarchy of an algorithm. This hierarchy is very important. Its idea is similar to the working mechanism of the human cerebral cortex, this is because the human brain is also a hierarchical structure when identifying something. The courseware mainly accepts multi-scale models, hierarchical model, structure spectrum, and so on, but does not show them in detail. It just gives a comprehensive introduction.
Lesson 16th Transfer learning & Semi-supervised learning
On the one hand, due to some problems, the training sample data is very small, and the acquisition cost of the sample is very high, or the model training time is very long. On the other hand, due to the similarity between many problems, so TL (transfer learning) is generated. TL puts multiple similar tasks together to share the same input space and output space. Common examples of TL include sensor network prediction, recommendation system, and image classification. Common TL problems include the following models: HLM (hierarchical Linear Model), NN, and regression linear model. These models are essentially the same feature space hidden by the school. In addition, the teacher also talked about the comparison between TL and GP (Gaussian process). Gaussian process is a nonlinear algorithm of Bayesian Kernel Machine, A sharp posterior probability model can be obtained through learning a prior sample. It is a non-parameter model. The TL method is mainly divided into four categories: migration between samples, feature expression migration, model migration, and knowledge migration in related fields. The feature expression migration and model migration are essentially similar in mathematics and are also the focus of scholars.
SSL (Semi-supervised learning) is used to learn a model that is better than simply using a small number of labeled samples and a large number of unlabeled samples. The instructor gave an example of Gaussian mixture distribution to explain the effect of SSL learning. This example introduces a general model of SSL. This course also briefly introduces the co-training method. The so-called co-training is to divide the data in a table group into several classes, each of which is train a model, then, these models are applied to the unlabel sample, and the output result is consistent through the optimization method. The Graph Laplacian and its harmonic solution are completely understandable.
Lesson 1 Recommendation Systems
A simple application of Recommendation Systems is to calculate the products that users may like based on their purchase history and recommend them to users. At present, many Internet companies are doing research in this area, because it can bring a lot of economic benefits. Recommendation Systems is a collaborative filtering problem. This course describes how to rate different movies by different users. The first problem to be solved is that the deviation of historical data is different, that is, data must be preprocessed to achieve normalization.
One of the mainstream methods for designing Recommendation Systems is to regard Recommendation Systems as a classification problem, that is, to regard user I as a label for predicting all movies, all others rate movies as features, mainly using Naive Bayes and KNN (most other classification algorithms can be used ). Another mainstream method of Recommendation Systems is to regard it as a matrix decomposition (MF) problem, which is the best in practical application. Because the observed data is sparse, many locations are missing, and there is a simple structure between the data, therefore, we can break down the R of the matrix to be filled into the product of two low-rank matrices, which can be solved by SVD or SVD + some optimization methods.
From this we can see that Recommendation Systems is a typical ML problem.
Lesson 18th computer vision
This lesson briefly introduces the basic problems in computer vision, such as the difficulties of computer vison and computer vison, and the classification of computer vison problems: Feature Detection, edge detection, and target detection, image segmentation, puzzles, 3D reconstruction, computer graphics, and target recognition.
Lesson 19th learning on the web
Machine Learning is widely used on the web, such as the recommendation system mentioned earlier. In addition, there are some search result sorting, classification issues, community behavior analysis, and user behavior models. This course mainly introduces classification and sorting. There is a variety of spam information on the network, such as spam, spam webpages, spam ads, etc. The classification problem is that the ML method is used to filter out the spam information. Another common classification problem is text classification, which finds out the topic of text description. The BOW algorithm is simple and achieves good results. Finally, the instructor gave a brief introduction to the Web-search issue. In short, this course will introduce the simple application and challenges of ML on the web.
Summary:
ML gives me the feeling that rule items and optimization run through all the chapters of this course.