Statistical Learning Method Study Note one

Source: Internet
Author: User
Chapter I. Introduction to Statistical learning methods the main characteristic of statistical learning is
         (1) The Platform--------Computer and network, is based on computers and networks,
         (2) The research object--------data, is a data-driven discipline;
         (3) The objective---------to forecast and analyze the data;
         (4) The center---------method, the statistical learning method constructs the model and applies the model to carry on the test and the analysis,
         (5) The cross subject--------The probability theory, the statistics, the information theory, the computation theory, the optimization theory as well as the computer science and so on many fields interdisciplinary.
the object of statistical learning
  The object of research is data
Created with Raphaël 2.1.0 Data starts to extract the knowledge from the Data feature discovery data to analyze and predict the data classification of statistical learning methods
  Supervised learning (supervised leaning) unsupervised learning (
  unsupervised leaning)
  semi-supervised learning (semi-supervised leaning)
  Intensive Learning ( Reinfoucement leaning)
three elements of a statistical approach
Statistical learning method = model + strategy (strategy) + algorithm (algorithm)
 model: Find a conditional probability or decision function that solves the problem.
 strategy: Find a loss function (such as 0-1 loss) that can optimize the model (or measure the model).
 algorithm: Find a way to optimize the loss function (for example: Gradient Descent method).
steps of the statistical method
1  get a limited training data set
 2 determine the hypothetical space (i.e. all possible models)
 3 determine the criteria for the selection model (i.e., strategy)
 4 implement the algorithm (i.e. algorithm) 5 to solve the optimization model
 6 Select the optimal model
 Using the optimal model to predict and analyze the new data
A study of statistical learningResearch on statistical learning methods--Discovering new learning methods--the study of statistical learning theory--improving the effectiveness and efficiency of statistical learning methods The research ——-apply statistical learning methods to practical problems and solve practical problems. supervised learning/supervised leaning
Supervised learning is the main study of this book

Supervised learning can also be called Guided learning, (you will learn better under the guidance and supervision of a teacher) so, in general, the supervised learning model is superior to unsupervised learning model. Of course there is a need for training sets as a cost, which means that supervised learning requires more resources than unsupervised learning (after all, guidance is needed).

Suppose that the eigenvector of the input instance x is recorded as
Training set:
Assuming that the input variable is represented by X, the output variable is represented by Y, and the input and output random variables X and y satisfy the joint probability distribution P (x, y), the model of the supervised learning problem is as follows:

This model is relatively easy to understand and is simple to understand: input the training set into our learning system and learn an optimal model based on the decision method-–> use this optimal model to predict the new data.

According to the input and output variables, the prediction task can be divided into the following three categories: the

regression problem-----The input and output variables are continuous variable prediction problem, the
classification problem------Output variable is a finite discrete variable prediction problem;
Labeling problem------The input and output variables are the pre-side problem of the variable sequence.
Their problem models only need to change the "forecast system" in the image above to "classification system" and "labeling system".
three elements Model

In the supervised learning process, the model is the conditional probability or decision function to be studied.
(decision function Model)
(conditional probability model) Strategy

Loss function and risk function
Loss functions (loss function) or cost functions are used to measure the predictive power of a model. The loss function is a nonnegative real value function between f (X) (predicted value) and Y (True Value) (because the difference between the two can be understood as the distance between the two, which is non-negative.) ), recorded as L (Y, F (X)).

Common loss function:

  (1) 0-1 loss functions (0-1 loss function)

(2) square loss functions (quadratic loss function)

(3) absolute loss functions (absolute loss function)

(4) Logarithmic loss function (logarithmic loss function) or logarithmic likelihood loss function (Loglikehood loss function)


There are, of course, other loss functions such as exponential loss function or hinge loss. The smaller the value of the loss function, the better the model, and the smaller the error of the model.
experience loss or experience risk
Because the model's input and output (x, y) are random variables that follow the Union distribution p (x, y), the expectation of the loss function is:

This is theoretically the loss of the model F (X) in relation to the mean value of the joint distribution P (x. x, y), called the risk function (risk function) or the desired loss (expected loss). The daily standard of learning is to choose the model with the least expected risk. because, on the one hand, according to the expected risk minimization model to use the joint probability distribution, on the other hand, the joint distribution is unknown, so supervised learning becomes a pathological problem.
Here we propose another concept: empirical risk. (according to my own understanding, with "experience" of East, is generally the average meaning of East, after all, experience is needed to accumulate. )
The average loss of model F (x) about the training data set is known as empirical risk (empirical risk) or experience loss (empirical loss):

The expected risk Rexp (f) is the model about the expected loss of the joint distribution, the empirical risk remp (f) is the model about the average loss of the training sample set. According to the law of large numbers, when the sample capacity n tends to infinity, the empirical risk tends to the desired risk. So a natural idea is to estimate the expected risk with empirical risk . However, because the number of training samples in reality is limited and even small, it is often not ideal to estimate the expected risk with empirical risk, and to correct the experience risk. This is related to the two basic strategies of supervised learning: empirical risk minimization and structural risk minimization.

Experience risk minimization (empirical risk minimization, ERM), which solves the optimization problem:

When the sample capacity is large enough, the experience risk minimization can guarantee a good learning effect (such as a person's accumulated experience, the more discriminating power will certainly be better) but when the sample capacity is very small, the learning effect of minimal experience risk is not very good (after all, the road is a little small, the world is so big, So it is easy to make the wrong judgment), it may produce "over fitting (over-fitting)" phenomenon. Therefore, structural risk minimization is required.

structural risk minimization (structural risk minimization, SRM) is designed to prevent overfitting, adding a regularization item (Regulatizer) or penalty (penalty term) that represents the complexity of the model at the empirical risk. Definitions are:

where J (f) is the complexity of the model (sometimes it can be understood as the number of parameters required by the model. )

Small structural risk requires empirical risk and the complexity of the model is small. algorithm

The specific calculation method of the learning model. The statistical learning problem boils down to the optimization problem, and the algorithm of statistical learning becomes the optimization question.
The algorithm of the problem. How to find the global optimal solution and make the solving process very efficient. Training error and test error

In general, we divide the dataset into two main categories: the training set and the test set. (sometimes divided into three parts: training set, validation set, test set).
Training error refers to the error of the model in training set, which reflects the learning ability of the model.
(on average loss of training data set)
The test error refers to the error of the model in the test set, which reflects the predictive ability of the model.
(on average loss of test data set) over fitting

overfitting (over-fitting): If you blindly pursue the ability to improve the pre-side of the training data, the complexity of the selected model will tend to be higher than the true model. This phenomenon is called overfitting (over-fitting). Over-fitting refers to a model that is selected at the time of learning to predict well-known data (data in the training data set), but is poorly predicted for unknown data (data in the test data set).

For example:

The above example is, according to the data distribution fitting polynomial model, m represents the model polynomial times, we can see m=0 and m=1 time, the model learning and prediction ability is not good, and m=9 time, the model learning ability is very well (almost all learned, that is, to fit the polynomial model, Can pass each training data sample point), but it has poor predictive power. And the model is too complex. When M=3, the model's learning ability and predictive ability are better. (Visually, from the graph image, the predicted curve model and the true curve model fit). the relationship between training error and test error and model complexity

model selection methods: Regularization and cross-validation

regularization We have learned, is the implementation of structural risk minimization strategy:

The second item in the formula is our regular item (or penalty).

cross-validation: use data repeatedly, slice a given data, combine a segmented dataset into a training set and a test set, and repeatedly train, test, and model the selection.

Simple cross 
    -validation first randomly divides the data into two parts, one part as the training set, the other as the test set, then trains the model with the training set under various conditions (for example, the number of different parameters) to obtain different models, and evaluates the test errors of each model on the test set. Select the model with the smallest test error.
 The K-fold cross-face syndrome (S-fold crosses validation)
     method is as follows: firstly, the data is randomly divided into a subset of the same size as S disjoint, and then using the data training model of the S-1 subset to test the model with the remaining subset ; This process is repeated for the possible s selection, and finally the model with the smallest average side test error in the S-sub-evaluation is selected.
 Leave a fork to verify (leave-one-out Cross validation)
     The special case of K-fold crossover verification is that the k=n,n is the capacity of a given dataset.
generate models and discriminant models

The supervised learning method can be divided into the generation method (generative approach) and the Discriminant method (discriminative approach). The models that are learned are called generation models (Geuemtive model) and discriminant models ( Discriminative model). The generative method is based on the data Learning Joint probability distribution P (x, y), then the conditional probability distribution P (Yix) is obtained as the predictive model, that is, the model is generated.

Such a method is called a build method because the model represents a build relationship that produces output y for a given input x. Typical generation models are: Naive Bayesian method and hidden Markov model.

the discriminant method is directly studied by the data decision function f (X) or conditional probability distribution P (y| X) as a predictive model, i.e. a discriminant model. The Discriminant method is concerned with the given input x, and what output y should be predicted. Typical discriminant models include K-nearest neighbor method, Perceptron, decision tree, logistic regression model, maximum entropy model, support vector machine, lifting method and conditional random field.

given the input x, the generation model can not directly predict the output of Y, need to calculate, then compare (or find out the probability of various output probabilities, the maximum as the final Solution), and the discriminant model may directly give the predicted result Y, (using the judgment rule or method)

features of the build method:

  1, the generation method can restore the joint probability distribution P (x, y), and the Discriminant method is not,
 2, the generation method of learning convergence faster, that is, when the sample capacity increases, the learning model can converge to the real model faster,
3, when there are hidden variables, can still use the generation method to learn, The Discriminant method is not available at this time.

characteristics of the discriminant method:

1, direct learning is the conditional probability P (y| x) or Decision function f (x), directly facing the prediction, often the accuracy of learning is higher;
2, because direct learning P (y| x) or F (x), you can simplify the learning problem by abstracting, defining features, and using features on the data in various degrees.
evaluation criteria for several models

TP (True Positive)-The positive class is predicted as a positive class number (d);
FN (False negative)-The positive class is predicted as a negative class number (c);
FP (False Positive)--predicts negative class as positive class number (b):
TN (True negative)--predicts negative class as negative class number (a).

Accuracy P (Positive) =tp/(TP+FP) =d/(d+b)
Recall rate R (Positive) =tp/(TP+FN) =d/(d+c)
F1 (harmonic mean of accuracy and recall)
F1 (Positive) = (2*p*r)/(P+R)

Similarly, p (negative), R (negative), F1 (negative) can be obtained.
These three metrics are typically used to detect the ability of a model to detect or predict each category.
Overall evaluation of the model if there is an accurate rate of AC (accuracy)
Ac= (a+d)/(a+b+c+d) (diagonal elements, positive and negative classes both predict the correct number of samples)/(total number of samples)
There are Roc curves and so on.
Finally posted a larger picture, do not understand the children's shoes do not have to be serious, can accurately understand the above several measurement standards also ok~~~

Chapter I statistical learning methods the main characteristics of statistical learning is the statistical study of the object of statistical learning methods of the classification and statistical methods of the three elements of statistical methods of statistics Study of the study of supervised learning supervised leaning three elements model strategy algorithm training error and test error over fitting training The selection method of the relation model between error and test error and model complexity regularization and cross-validation generation model and discriminant model evaluation criteria for several models

                                                          Life is like chess, Lazi no regrets
                              ----by Ada

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.