Three elements of machine learning

Source: Internet
Author: User
Machine learning algorithm principle, implementation and practice-three elements of machine learning

 

1 Model

In supervised learning, the model is the conditional probability distribution or decision-making function to be learned. The hypothetical space of the model contains all possible conditional probability distributions or decision functions. For example, if a decision function is a linear function of an input variable, the hypothetical space of the model is a set of functions composed of these linear functions.

Assume that the space is represented by $ \ mathcal {f} $. Suppose the space can be defined as a set of decision functions.
$ \ Mathcal {f }=\{ f | Y = f (x) \}$

$ X $ and $ y $ are variables defined in the input space $ \ mathcal {x} $ and output space $ \ mathcal {y} $. $ \ Mathcal {f} $ is usually a function family determined by a parameter vector.
$ \ Mathcal {f }=\{ f | Y = F _ {\ Theta} (x), \ Theta \ In \ mathbf {r} ^ n \} $

The parameter vector $ \ Theta $ is valid at $ N $ Dimension Euclidean Space $ \ mathbf {r} ^ N $. It is calledParameter Space(Parameter space ).

Assume that the space can also be defined as a set of conditional probabilities.
$ \ Mathcal {f }=\{ p | P (Y | x) \}$

$ X $ and $ y $ are variables defined in the input space $ \ mathcal {x} $ and output space $ \ mathcal {y} $. $ \ Mathcal {f} $ is usually a conditional probability distribution family determined by a parameter vector.

$ \ Mathcal {f }=\{ p | P _ {\ Theta} (Y | X), \ Theta \ In \ mathbf {r} ^ n \} $

The model represented by the decision function isNon-Probability Model, The model represented by the conditional probability isProbability Model.

2. Policy

With the assumption space of the model, machine learning should then consider the criteria for learning or selecting the optimal model.
First introduceLoss FunctionAndRisk Functions. The loss function measure model is used to predict the quality of a model at a time, and the risk function measure is used to predict the quality of a model in an average sense.

2.1 loss functions and risk functions

For the decision-making function model selected in the given input $ x $ and hypothetical space $ \ mathcal {f} $, by $ f (x) $ provide the corresponding Input $ y $. The pre-empty value of this output $ f (x) $ may be the same or different from the actual value $ y $, A loss function or cost function is used to measure the degree of prediction errors. The loss function is a non-negative real-value function of $ f (x) $ and $ y $. It is recorded as $ L (Y, f (x) $

Several common loss functions:

1) 0-1 loss function (0-1 loss function)
$ L (Y, f (x) =\begin {cases} 1, & Y \ NEQ f (x) \ 0, & Y = f (x) \ end {cases} $

2) Quadratic Loss Function)
$ L (Y, f (x) = (Y-f (x) ^ 2 $

3) Absolute loss function (absolute loss function)
$ L (Y, f (x) = | Y-f (x) | $

4) log loss function (logarithmic loss function) or log likelihood loss function
$ L (Y, P (Y | X) =-LOGP (Y | x) $

The smaller the loss function value, the better the model. Because the input and output $ (x, y) $ of the model are random variables and follow the combined distribution $ p (x, y) $, the loss function is expected
$ R _ {exp} (f) = E_P [L (Y, f (x)] = \ int _ {\ mathcal {x} \ times \ mathcal {y} l (Y, f (x) p (x, y) dxdy $

This is a theoretical model $ f (x) $ about the loss in the mean of the Union distribution $ p (x, y) $, calledRisk Functions(Risk function) orExpected loss(Expected loss ).

The goal of learning is to select the model with the minimum expected risk. Since the joint distribution $ p (x, y) $ is the statistical rule observed by all samples and is unknown, $ R _ {exp} (f) $ cannot be directly calculated. In fact, if you know the Union distribution, you can directly calculate $ P (Y | x) =\int _ {\ mathcal {x} p (x, y) dx $, you do not need to learn.
Therefore, it is impossible to define risk functions in the above way, So supervised learning becomes a pathological problem.

For a given training Dataset
$ T = {(x_1, y_1), (X_2, y_2), \ dots, (X_n, Y_n)} $

Model $ f (x) $ the average loss of a training dataset is referred to as an empirical risk or an empirical loss. It is recorded as $ R _ {EMP} $:
$ R _ {EMP} (f) = \ frac {1} {n} \ sum _ {I = 1} ^ NL (y_ I, F (x_ I) $

Expected Risk $ R _ {exp} (f) $ is the expected loss of the model on the joint distribution, and the empirical risk $ R _ {EMP} (f) $ is the average loss of the model on the training sample set. According to the law of large numbers, when the sample size $ N $ tends to be infinite, empirical risk $ R _ {EMP} (f) $ tends to expected risk $ R _ {exp} (f) $

Therefore, a natural idea is to estimate expected risks with empirical risks. However, because the number of training samples is very limited in reality, it is often not ideal to estimate the expected risk with empirical risk. Therefore, we must correct the empirical risk. This is related to two basic strategies for supervised learning:Minimize empirical risksAndMinimize structural risks.

2.2 experience risk minimization and structural risk minimization

When the space, loss function, and training dataset are determined, empirical risk function can be determined. The strategy of minimizing empirical risk is that the model with the smallest empirical risk is the optimal model. Based on this strategy, the optimal model for minimizing empirical risks is to solve the optimization problem:
$ \ Min _ {f \ In \ mathcal {f }}\ frac {1} {n} \ sum _ {I = 1} ^ NL (y_ I, F (x_ I) $

$ \ Mathcal {f} $ is the hypothetical space.

When the sample size is large enough, minimizing empirical risks can ensure a good learning effect, which is widely used in reality. For example, the maximum likelihood estimation is an example of minimizing empirical risks. When the model is conditional probability distribution and the loss function is a logarithm loss function, minimizing empirical risk is equivalent to maximum likelihood estimation.
However, when the sample size is very small, the learning effect of minimizing empirical risks may not be very good.Overfitting(Over-fitting.

Structural risk minimization SRM is a strategy proposed to prevent overfitting. Structural risk minimization is equivalentRegularization. Structural risks are added to empirical risks to indicate the regularization or penalty of model complexity. When the hypothesis space, loss function, and training sample set are determined, the structure risk is defined
$ R _ {SRM} (f) =\frac {1} {n} \ sum _ {I = 1} ^ NL (y_ I, F (x_ I )) + \ Lambda J (f) $

$ J (f) $ is the complexity of the model, and is the functional defined on the hypothetical space $ \ mathcal {f} $. The more complex the model $ F $ is, the more complexity $ J (f) $ is. The more simple the model $ F $ is, the less complexity $ J (f) $ is. That is to say, complexity represents the punishment for the complex model. $ \ Lambda \ GE 0 $ is a coefficient used to weigh the empirical risk and complexity of the model. Small Structural risks require both empirical and model complexity. Models with low structural risk often have good predictions for training data and unknown test data.
For example, maximum posterior probability estimation (MAP) in Bayesian estimation is an example of minimizing structural risks. When the model is conditional probability distribution, the loss function is the logarithm loss function, and the complexity of the model is expressed by the prior probability of the model, the minimum structural risk is equivalent to the maximum posterior probability estimation.

The strategy of minimizing structural risk is that the model with minimum structural risk is the optimal model. Therefore, the optimization model is to solve the optimization problem:

$ \ Min _ {f \ In \ mathcal {f }}\ frac {1} {n} l (y_ I, F (x_ I) + \ Lambda J (f) $

In this way, supervised learning becomes the optimization problem of empirical risks or structural risk functions. In this case, empirical or structural risk functions are the optimized objective functions.

3 Algorithm

From the above, we can see that after determining the optimal model search policy

The problem of machine learning comes down to optimization. The algorithm problem discussed in machine learning becomes an algorithm for solving the optimal model. In addition, the optimal model usually does not have an analytical solution and needs to be solved by numerical calculation. We need to ensure that the global optimal solution is found and the solution process is very efficient.

Three elements of machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.