Machine Learning notes of the Dragon Star program

Source: Internet
Author: User
Tags svm

 

  Preface

In recent weeks, I spent some time learning the machine learning course of the Dragon Star program for the next summer vacation. For more information, see the appendix. This course chooses to talk about the basic model in ml. It also introduces popular and new algorithms in recent years. In addition, it also combines ml theory with actual problems, for example, it is applied to visual and Web applications. In short, although the course content is not very detailed (after all, there are only a few classes), the content is relatively new and comprehensive. After these lessons, I learned a lot. At least I learned a little bit about my skills. Below are some simple notes in the course.

 

  Lesson 1 Introduction

In machine learning, there are three indispensable elements, Data, models, and algorithms. Nowadays, data sources are widely used, and t-level data can be generated every day. The model is the various models that need to be studied in the machine learning course. The algorithm is how to learn the parameters in the model through data and models. However, in the classroom, Mr. Yu proposed that these three elements are not important. The most important thing is the demand. Once there is a need, various methods will be used to solve the problem. He is Baidu's Deputy Technical Director. In addition, the main application scenarios of machine learning include computer vision, speech recognition, natural speech processing, search, recommendation system, self-driving, and Q & A system.

 

  Class 1 Linear Model

The linear regression model needs to solve the following three problems:

1. How to estimate linear model parameters from training data? That is, intercept and slope.

2. What is the performance of the learned linear model? Can we find a better model?

3. How to estimate the importance of two parameters in the model?

Solving the 1st problems is an optimization problem, that is, obtaining the parameter that minimizes the loss function. Here, the loss function is a square term, also known as the linear least square theory. The linear model expression is:

  

The noise parameter is a Gaussian noise with a mean value of 0. If the noise obtained later is not a random variable with the mean value of 0 and the variance is the same as that of Gaussian distribution, the model can be improved. For example, map X to a non-linear function first, and then use the least square method to perform linear regression for the non-linear function. As for how to obtain the non-linear ing function f (x), it is automatically obtained either through human observation or through feature learning in machine learning.

A generalized linear model is not necessarily a linear equation. But its parameters may be linear. Linear models can simulate non-linear functions.

Residual can be seen as an approximation of noise. But in general, the residual is smaller than the noise. Therefore, in a linear model, residual data can be used to estimate the noise. However, the denominator is not 1/N, but 1/(n-p) because an unbiased estimation is required.

There are two common methods to evaluate the importance of feature vector element attributes: the first is to extract a feature, and then calculate the ratio of its residual variation value to all features, the score obtained is F-score. The greater the score, the more important this attribute is. The second method is to use the tdistribution to test and obtain the Z-score. That is, if the corresponding feature attribute does not exist (that is, the value is 0), the probability of sample data occurrence is Z-score, the greater the Z-score, the less important this attribute is.

 

  3rd course preparation and rules

Regularization is a rule that balances overfitting and underfitting and controls the complexity of the model by limiting the parameter space. The difference between the test error and the training error is a rule item. The formula is as follows:

  

The more complex the model is, the more unstable the model is. the smoother the learned target function is, the more likely it is to over-fitting. Therefore, you need to control the complexity of the model. Generally, there are two ways to reduce the number of parameters in the model or the space size of parameters. Currently, the most commonly used method is to reduce the space size of parameters, it is achieved through rule items. When a rule item is introduced, an adjustment parameter is also required. The parameter size is generally obtained through cross-validation. If the rule item is twice, it is also called Ridge Regression. If the rule item is once, it is called lasso regression. Ridge Regression has the advantage of stable solution and allows the number of parameters to be greater than the number of samples. Lasson regression has the advantage of sparse solution, but the solution is not necessarily stable.

If the number of parameters is greater than the number of samples, the number of parameters cannot be normalized. Instead, the parameter space can be reduced. In this way, it is statistically robust to the large number of feature sets, at the same time, the equation solution is stable in numerical calculation.

 

  Class 1 linear classifier

A good understanding of linear classifier can understand many ml concepts and non-linear problems. Linear classifier is the most useful model in practical application.

According to instructor Yu, artificial neural networks have started to heat up since, mainly in the field of deep learning.

SVM theory is perfect and widely used. Likewise, logistic regression is widely used, similar to SVM.

When the data is big sample data, the linear SVM model is better.

 

  Lesson 1 nonlinear SVM

Rkhs representation theorem: The model parameter is a linear combination of training samples in the linear subspaces of training samples. This applies not only to SVM, but also to other models, such as perception machines, RBF Networks, LVQ, boosting, and logistic regression.

Kernel is a measure that represents the similarity of two values. You can better understand regularization through core functions. The target functions to be optimized can be written as parameters, dual parameters and non-parameter parameters. If the rule items are controlled by the learned function f (x) in a non-parameter form, its model is inversely proportional to the feature coefficient of the corresponding kernel function for feature function decomposition. That is, the feature coefficients corresponding to the non-principal component functions in feature Function Decomposition are small, and the penalty is large, which will be more restrained. Therefore, we retain the feature functions of principal components. We can see from the above that the kernel function has a certain structure, which determines what the final target function f (x) looks like.

The difference between logistic regression and SVM is that the loss function is different. The loss function of logstic regression is a logstic function, and the loss function of kernel SVM is hinge loss. The two have the same performance. Logistic regression is a probability-based output, which is easier to use for multiclass classification problems. But currently, both methods are old.

LVQ is a model-based supervised learning classifier.

So what loss functions should we consider when designing a model? What kind of basic function h (x) is used )? Is h (x) finite or infinite? Do you need to learn h (x )? What method is used to optimize the target function, such as Qp, lbfgs, or gradient descent?

In theory, the kernel theory can be used to implement the learning of infinite space with limited computation. However, due to the complexity of the actual problem, the number of samples n is the 3rd power, so when there is a lot of sample data, basically, it cannot be implemented.

The difference between a parameter model and a non-parameter model is not to check whether there are parameters in the model. All models have parameters. A non-parameter model means that as the number of samples increases, the number of parameters in the model also increases. Otherwise, the parameter model is used. Common non-parameter models include Gaussian process, kernel SVM, and Dirichlet process.

 

  6th course Model Selection

Model Selection is very useful in practical application. Generally, model-related data is divided into three parts: training data, verification data, and test data, as shown in:

  

Both the training data and validation data are existing sample data, that is, the observed data. Test data is the data generated in future practical applications and is unknown in advance.

The model parameters are divided into two parts. The first part is the parameters obtained after the model is determined and learned through the training sample. Another part is manually input parameters, also known as hyperparameters, which are used to control the complexity of the model, that is, to control the model itself, which is adjusted by the verification data.

The question of model selection is how to verify whether a model is good. The final performance of a model depends on its performance in the test dataset. Therefore, when no test data is observed, we can only use the verification data set to replace it for testing. Generally, the cross verification method is used, for example, LOOCV, that is, leave one method for cross verification, and similar K-fold Cross verification. The main purpose of cross-validation is to prevent trained models from over-fitting. However, in today's world, because the data is massive, the cross-validation method is getting fewer and fewer, because if the training data set is very large, there is generally no overfitting phenomenon.

There are also some ways to evaluate the model quality directly without verification, such as AIC, Bic, MDL, SRM, etc.

 

  7th course model average

The model mentioned in this Article refers to a learning algorithm, which is even smaller than the scope of learning algorithm, because in a learning algorithm, different parameter adjustments and different input features lead to different models. The goal of model selection is to make the model have better interpretability and better performance, while the average target of the model only needs to make the model have better performance, because many models are used in the model average process, the larger the number of models, the lower the interpretability. Model average English names include model ensemble, model blending, model combination, and model averaging.

The differences between model selection and model combination are embodied in the fact that if a model has an absolute advantage over all other models, we will adopt model selection at this time, because it not only has good performance, in addition, good interpretability can be obtained. If the performance of all models is similar, there is no such thing as good or bad, and the model itself is quite different, model combination can be used to greatly improve its performance. Generally, model combination is more stable than model selection.

So how can we construct a model with large differences? You can start from the following four aspects:

1. Different learning algorithms.

2. Adjust different parameters.

3. Different Input features.

4. introduce random ideas, such as bagging.

The average value of the exponential weight model only changes the voting right value to the exponential form of model error on the basis of the average value of the uniform model (that is, the voting method), rather than the same mean. If the error of the learned model is greater, the lower the weight, the more perfect in theory. However, when Mr. Zhang talked about his experiment, he found that there was no improvement, and sometimes the effect was not as good as voting.

Stacking is similar to the exponential Weight Model on average. Instead, it learns each model and uses the learned model as the input for Layer 2 learning, optimize the smallest layer-2 error to learn the weight of the model.

Bagging is also a uniform model mean. It uses the same learning algorithm for all models, except that the input sample is obtained using bootstrip. Because it is obtained using boostrip, some of its training samples are not necessarily used, while others are used repeatedly. In this way, each learned model is not very stable, which expands the differences between models and improves the cluster learning performance. Bagging reduces the learning variance while boosting reduces the learning deviation.

Finally, a well-known application of model average is to transform a decision tree into a random forest. Although a single decision tree can be interpreted, It can well handle non-even features and is a non-linear method, but its biggest drawback is that the classification results are inaccurate, therefore, after a random method is used for sample selection and Input Feature Selection to obtain different models, then the average method becomes a random forest, theory and Experiment show that the random forest effect is much better than decision tree.

 

  Lesson 8th boosting

Boosting can be viewed as either signal learning or ensemble learning. In this lesson, boosting is considered as ensemble learning. It is a combination of multiple weak classifiers into a strong classifier, but the conditions described here for the weak classifier are actually not weak, because it needs to meet the requirements for samples, the classification effects of weighted conditions must be greater than 0.5. Therefore, many scholars do not call these weak classifiers, but basic classifiers. The most common algorithm in boosting is sort sting, which increases the weight of samples with incorrect classification to achieve resamble. The greedy algorithm is used to optimize loss functions.

The traditional definition of VC dimension is: For an indicator function set, if H samples can be separated by functions in the function set in the K-power form of all possible 2, the function set can scatter H samples. The VC Dimension of the function set is the maximum number of samples H that it can scatter.

Boosting sting is not the biggest margin, but why is it better than the biggest marign boosting? The course provides some explanations from the traditional boosting analysis, but it still cannot be explained that when the training error is 0, its generalization error is still decreasing, later scholars have raised the question of margin bound. In addition, the method of better understanding of boosing from another perspective is greedy boosting, that is, the process of searching for sample weight D and weak classifier weight W is a greedy process. Finally, the teacher talked about a general loss function and general boosting using this function.

 

  Introduction to Learning Theory in Lesson 9th

The content of this lesson is more theoretical and hard to understand. The main goal of machine learning theory is to mean the quality of a learning algorithm, that is, how to estimate the test error through the training error. Consistent convergence can be used to estimate the relationship between the training error and the test error. That is, the test error is smaller than the training error and a value is added to the event with a higher probability, the size of this value is related to the number of training samples and the probability value. To prove the above consistency convergence, we need to use several technologies, such as the cherbihov inequality, VC dimension, and covering numbers. Covering numbers is defined as the number of prediction functions of the attain training sample (which is not clearly understood ). We can use the VC dimension to estimate the convering number. At last, the teacher also talked about the complexity of Rademacher and the relationship between Rademacher and VC. I really don't know what Rademacher is!

 

  Class 1 Optimization in Machine Learning

Most of the problems in machine learning can be attributed to parameter optimization, that is, to find the most suitable parameter for the target function. This parameter is generally used to maximize or minimize the target function.

A common optimization method is the gradient descent method. This method is used to find the function value in the fastest gradient direction each time, and the approximate extreme value can be searched through continuous iteration. The learning rate and Convergence Rate of this method are the most noteworthy. Generally speaking, if a function is smooth and strictly convex, the convergence speed is the fastest. In fact, it is smooth but not strictly convex. the slowest is a non-smooth function. Therefore, when some functions are smooth and the other is not smooth, we can use the proximal gradient descent method, which has been popular in recent years and has a better effect than gradient descent, the updated algorithm is similar to the accelerated Gradient Method of nestervo (all mathematical formulas are completely incomprehensible ). To obtain the local extreme point, We can generally use the H matrix in the approximate Taylor expansion. A typical algorithm is lbfgs. In addition, when the parameter to be optimized is a vector, we do not need to consider the element equivalence of this vector. We can separate optimization, that is, only one of the parameter vectors can be optimized at a time, the rest remain unchanged, so that the loop continues until convergence. Finally, the teacher spoke about the problem of convex function optimization. The dual gradient descent method can also be used.

To be honest, this pure mathematical formula is too boring!

 

  Lesson 1 Online Learning

Online learning refers to learning an optimal prediction function whenever a data is generated. The optimal criterion is that the value of the loss function at the current position is the minimum, therefore, the prediction functions in each step may be different. This is online learning. In fact, there were online learning examples long ago, such as perception machine learning rules.

Before learning about online learning, you need to understand the probability of regret analysis. Regret refers to the average value of the error generated by each Learning operation in online learning minus the error generated by using the optimal function so far, of course, we hope that the smaller the regret, the better.

The key to online learning is to constantly update the status. In fact, online learning is also an optimization problem. We can convert all the optimization problems mentioned in section 10th into the corresponding online learning. For example, convex optimization, gradient descent, and proximal descent. To convert a proximal descent to an online version, L1 normalization, dual averaging, and second order information can be used. Statistical gradient descent can be used to optimize large-scale data. Different variants mainly come from different proximal functions, different learning rates, dual averaging, averaging, and acceleration.

 

  12th sparsity Model

When sparsity model emerged, the number of samples is much smaller than the feature dimension to solve the problem of dimension disaster in statistical learning. The standard sparse regression model can adopt the greedy algorithm and convex relaxation. The greedy algorithm is representative of OMP. Parameters to be reconstructed from sparse parameters must have two conditions: irrepresentable and rip. A typical problem of sparse models is the solution of lasso. The instructor introduced the lasso solution from the above two conditions. Lasso is based on L1 regularization. The sparsity model corresponding to some other complex rule items includes structured sparsity (such as group structure), graphical model, and matrix regularization. This is a pure mathematical course.

  Lesson 13th graphical model

Graphical model is a widely used model, but it is complicated because it involves a lot of knowledge about probability. However, the content of this lesson is relatively superficial, with no too many details. This article mainly introduces graphical model, namely model itself, inference method and model structure learning from three aspects. Most of the probability models are graphic models, and graphic models are classified into Directed Graphs and undirected graphs. Directed Graphs represent Bayesian Networks and undirected graphs represent MRF. This section focuses on Directed Graphs. Any complex Bayesian network can be composed of three parts: causal chains, common cause, and common effect. Graphical model is widely used. For example, common Linear Regression Problems can also be converted to graphical model problems. If it is a piecewise linear regression problem, it can also be converted to a graphical model with hidden variables. In Bayesian Networks, inference is generally used to obtain the probability of some intermediate states under the observed data. When the network is a simple chain or tree, the reasoning is relatively simple. When the model contains a ring structure, the corresponding reasoning is very complicated. The last problem in graphical model is model structure learning. It can be viewed as a structure search problem. Many AI search algorithms can also be used at this time. Structural learning involves discovering hidden variables in the model, and the causal relationship directly learns its structure from the data.

 

  Lesson 2 structured learning

The methods and theories of structure learning include Structure Input, structure output, and structure model. The structure model is divided into conditional model and generative model. The generative model includes hmm And hmm has the assumption that the observed value is independent. To solve the problem caused by this assumption, a senior student later proposed the memm algorithm. However, memm itself brings about the annotation bias problem, the improved CRFs solves the labeling bias problem. The CRF model can be seen as an extension of Logistic regression in the structural learning framework. Similarly, m3N can be seen as an extension of SVM in the structural framework. Finally, the teacher compared the CRFs and m3N algorithms.

 

  Lesson 1 deep learning

The content in this lesson is easy to stimulate people's interest. First, it is because deep learning is very popular recently, and second, it is because deep learning is used for some visual problems, and its effect can be improved a lot. This course does not cover specific details. It mainly introduces some concepts and applications of deep learning. Deep Learning means that you can automatically learn some features. For example, in visual classification or recognition, feature extraction + classifier design is generally used, in addition, the quality of extracted features directly affects the classification effect of classifier. However, in the current computer vision field, feature extraction is manually designed, different features need to be extracted for different application scenarios. Teacher Yu joked that the greatest achievement of computer vision in the last 10 years was the sift feature, however, it is based on RGB Images. Nowadays, various sensors, such as Kinect, are proposed. We have to re-design its features. Do we have to wait for 10 years? Therefore, we can see that a general feature extraction framework needs to be provided. This is deep learning, also known as feature learning. That is to say, the system can automatically learn the features of these samples for many samples, instead of designing manually. How attractive it sounds! This is more like AI. Deep Learning is mainly used to determine the hierarchy of an algorithm. This hierarchy is very important. Its idea is similar to the working mechanism of the human cerebral cortex, this is because the human brain is also a hierarchical structure when identifying something. The courseware mainly accepts multi-scale models, hierarchical model, structure spectrum, and so on, but does not show them in detail. It just gives a comprehensive introduction.

 

  Lesson 16th transfer Learning & Semi-Supervised Learning

On the one hand, due to some problems, the training sample data is very small, and the acquisition cost of the sample is very high, or the model training time is very long. On the other hand, due to the similarity between many problems, so Tl (transfer learning) is generated. TL puts multiple similar tasks together to share the same input space and output space. Common examples of TL include sensor network prediction, recommendation system, and image classification. Common TL problems include the following models: HLM (hierarchical Linear Model), NN, and regression linear model. These models are essentially the same feature space hidden by the school. In addition, the teacher also talked about the comparison between TL and GP (Gaussian process). Gaussian process is a nonlinear algorithm of Bayesian Kernel Machine, A sharp posterior probability model can be obtained through learning a prior sample. It is a non-parameter model. The TL method is mainly divided into four categories: migration between samples, feature expression migration, model migration, and knowledge migration in related fields. The feature expression migration and model migration are essentially similar in mathematics and are also the focus of scholars.

SSL (semi-supervised learning) is used to learn a model that is better than simply using a small number of labeled samples and a large number of unlabeled samples. The instructor gave an example of Gaussian mixture distribution to explain the effect of SSL learning. This example introduces a general model of SSL. This course also briefly introduces the co-training method. The so-called co-training is to divide the data in a table group into several classes, each of which is train a model, then, these models are applied to the unlabel sample, and the output result is consistent through the optimization method. The graph Laplacian and its harmonic solution are completely understandable.

 

  Lesson 1 Recommendation Systems

A simple application of recommendation systems is to calculate the products that users may like based on their purchase history and recommend them to users. At present, many Internet companies are doing research in this area, because it can bring a lot of economic benefits. Recommendation Systems is a collaborative filtering problem. This course describes how to rate different movies by different users. The first problem to be solved is that the deviation of historical data is different, that is, data must be preprocessed to achieve normalization.

One of the mainstream methods for designing recommendation systems is to regard recommendation systems as a classification problem, that is, to regard user I as a label for predicting all movies, all others rate movies as features, mainly using Naive Bayes and KNN (most other classification algorithms can be used ). Another mainstream method of recommendation systems is to regard it as a matrix decomposition (MF) problem, which is the best in practical application. Because the observed data is sparse, many locations are missing, and there is a simple structure between the data, therefore, we can break down the r of the matrix to be filled into the product of two low-rank matrices, which can be solved by SVD or SVD + some optimization methods.

From this we can see that recommendation systems is a typical ml problem.

 

  Lesson 18th Computer Vision

This lesson briefly introduces the basic problems in computer vision, such as the difficulties of computer vison and computer vison, and the classification of computer vison problems: Feature Detection, edge detection, and target detection, image segmentation, puzzles, 3D reconstruction, computer graphics, and target recognition.

 

  Lesson 19th learning on the Web

Machine Learning is widely used on the web, such as the recommendation system mentioned earlier. In addition, there are some search result sorting, classification issues, community behavior analysis, and user behavior models. This course mainly introduces classification and sorting. There is a variety of spam information on the network, such as spam, spam webpages, spam ads, etc. The classification problem is that the ML method is used to filter out the spam information. Another common classification problem is text classification, which finds out the topic of text description. The bow algorithm is simple and achieves good results. Finally, the instructor gave a brief introduction to the Web-search issue. In short, this course will introduce the simple application and challenges of ML on the web.

 

  Summary:

ML gives me the feeling that rule items and optimization run through all the chapters of this course.

 

  References:

Http://bigeye.au.tsinghua.edu.cn/DragonStar2012/index.html

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.