deep understanding of machine learning: Learning Notes from principles to algorithms-1th week 02 easy to get started
Deep understanding of machine learning from principle to algorithmic learning notes-1th week 02 Easy to get started 1 General model statistical learning theory frame learner input Learner output A A simple data generation model measures success considerations learner accessible Information 2 experiential risk minimization 3 minimizing empirical risk considering inductive bias
My homepage www.csxiaoyao.com
Chapter two analyzes and proves the factors to be considered in learning problems. Papaya, for example, to learn to determine whether the papaya is delicious, need to observe the color of papaya and the degree of soft and hard to determine whether to eat.
The first is to describe a formal model that can depict similar learning tasks.
2.1 General Model--the theoretical framework of statistical learning 1 input of the learning device
Domain set: X, for example, a collection of all papayas.
Tag set: Y, currently only discusses binary sets, such as {0,1} or {−1,+1}, which means that papaya is delicious and not delicious.
Training data: shaped like s = (x 1, y 1) ... (x m, y m)) A finite sequence in which elements are xxy in pairs, and S is called a training set. 2 The output of the learning device
The H:x→y output prediction rule is also known as a predictor, hypothesis, or classifier, such as predicting whether the papaya in a farmers ' market is tasty or not. A (S) represents the assumption that learning algorithm A is derived from the given training sequence S. 31 Simple data generation models
How training data is generated. First, suppose the instance (papaya) is sampled according to some probability distribution D (island environment). The learner does not know any information about this probability distribution at this time. Suppose the existence (the learner does not know) the correct tag function f:x→y, so that for any i,yi=f (xi), the learner's task only needs to indicate the correct label of the sample (whether papaya is delicious). In summary, the production process of the training set S is to collect the sample point XI according to the probability distribution D, and then give the label to it by using the correct tag function f. (h is the predictive result, F is known as the relational function) 4 measure Success
Classifier (predictive) Error: That is the error of H, that is, the probability of H (x)!=f (x), where x is a random sample collected according to distribution d.
form, given a domain subset a⊂x, probability distribution d,d (a) determines the probability of x∈a, A is more like an expression π:x→{0,1}, that is, a= {x∈x:π (X) = 1} to determine if A is in X, at which point D (a) can be represented by P x∼d [π (X)].
The error rate of the predictive criteria h:x→y is defined as:
L d,f (h) =px∼d [H (x)!=f (x)]= D ({x:h (x)!=f (x)})
where x is a random sample of X, L d,f (h) is also known as generalized error, loss, or true error of H. L (loss) represents the error. 5 Precautions: The information that the learner can access
Distribution D and Tag function f are unknown to the learner, and learners need to observe the training set. 2.2 Empirical risk minimization
Because the learner does not know D and F, it is not possible to learn the true error directly, only the training error can be calculated:
where [M] = {1,..., m}, starting from predictor H to minimizing LS (h) is called empirical minimization, referred to as ERM. Erm may have been fitted, LS (h) Small does not represent L d,f (h) small.
2.3 Empirical risk minimization considering inductive bias
Fixed ERM the usual solution is to use ERM in a restricted search space, in which the learner should select the set of the Predictor (assuming class H) before reaching the data, and the ERMH learner chooses a h∈h using the ERM rule based on minimizing the probability error on S.
Because this choice is determined by the learner's exposure to the data, it requires a priori knowledge of the learning problem, although the choice of restricted hypothesis classes can prevent the fitting, but it also brings a stronger inductive bias.
One of the simplest limitations for a class is to limit the upper bound of its potential (the number of H in h). In machine learning, it is assumed that the training samples in s are extracted independently from the same distribution in D, but there may still be a training sample that is completely unrepresentative of distribution D, so we will represent the probability of sampling to an unrepresentative sample as Δ, while (1−δ) is called a confidence parameter.
Since it cannot be placed in the absolute accuracy of the label prediction, the introduction of a parameter to evaluate the predictive quality, called the Precision parameter, is recorded as ε, if L d,f (HS) <=ε, we believe that an approximate correct prediction has been obtained.
Misleading set :
Summary: For large enough m, the finite hypothetical class generated by the ERMH rule will approximate the probability (confidence is 1−δ) approximation (the error upper bound is ε) correct.