Statistical Learning Method notes <Chapter 1>

Source: Internet
Author: User

Chapter 1 Statistical Learning Method Overview

1.1 Statistical Learning

Statistical Learning is a discipline in which computers use data probability models to predict and analyze data. Statistical Learning is also known as statistical machine learning. Currently, machine learning is generally referred to as statistical machine learning.

The object of statistical learning is data. The basic assumption about the data is that similar data has certain statistical regularity (premise): for example, you can use random variables to describe the characteristics of the data, describes statistical rules of data with probability distribution.

The purpose of statistical learning is to analyze existing data, construct a probability statistical model, analyze and predict new unknown data, and consider the complexity of the model and the efficiency of the learning model.

Statistical Learning methods include supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.

The statistical learning method includes the hypothetical space of the model, the model selection criteria, and the model learning algorithm. It is called the three elements of statistical learning, referred to as model and strategy) and algorithm (algorithm ).

1.2 supervised learning

It is the focus of this book and the most extensive and widely used part of statistical learning.

The supervised learning part can be simply regarded as the input vector input model to obtain the output vector. The input vector belongs to the input space, and sometimes the input vector is mapped to the feature vector ), sometimes it is assumed that the input space is the same as the feature space.

Consider that the input and output variables are discrete or continuous, and give different names to the prediction task:

Input-> output Prediction Task Name
Continuous-> continuous Regression Problems
Continuous> discrete, discrete> discrete Classification Problems
Discretization-> discretization Annotation Problems

 

 

 

 

Supervised Learning assumes that random input and output variables X and Y follow the joint probability distribution p (x, y), and training data is considered to be generated in an independent distribution based on the joint probability. The assumption of joint probability distribution is the basic assumption of supervised learning about data. The supervised learning model can be a probability model (represented by conditional probability distribution P (Y | X) or a non-probability model (decision function) y = f (x )).

1.3 three elements of Statistical Learning

Method = model + Policy + Algorithm

1.3.1 model: the model is divided into probability model and non-probability model based on whether the joint probability distribution (or conditional probability distribution) or decision-making function is to be learned.

1.3.2 policy: specify the criteria for learning or selecting the optimal model.

Evaluate the difference between the predicted value and the actual value. Use a loss function to measure the degree of prediction error. It is recorded as L (Y, f (x )). Common loss functions include: 0-1 loss function (0-1 if the loss function is correctly classified as 0, the error is 1); square loss function (the square of the difference between the quadratic loss function ); absolute loss function (absolute value of the absolute loss function) and log loss function (logarithmic loss function. The smaller the loss function, the better the model (mark does not consider overfitting ?).

Empirical risk minimization (ERM): minimizes the loss function:

  

When the model is a conditional probability distribution and loss function type logarithm loss function, ERM is equivalent to maximum likelihood estimation ).

Structural risk minimization (SRM): When the sample size is small, overfitting is prone to problems. SRM is designed to prevent overfitting. SRM is equivalent to regularization (mark ). SRM adds regularizer or penalty term that represents the complexity of the model on the basis of ERM ):

  

That is, experience risks and low complexity of the model must be met. When the model is conditional probability distribution and the loss function is a logarithm loss function, and the complexity of the model is expressed by the prior probability of the model, SRM is the maximum posterior probability estimation in Bayesian Estimation (maximum posterior probability estimation, map ).

1.3.3 algorithm: the method used to learn the model (nothing to write at the moment)

1.4 model evaluation and Model Selection

When selecting a model, you must have good fitting ability for known data and good prediction ability for unknown data. That is, it is necessary to minimize the risk of experience and prevent overfitting.

(M = data has poor fitting effect, M = 9 over fitting, M = 3 is a good prediction model, M indicates the maximum number of polynomials)

   

1.5 regularization and cross-validation

Structural Risk = experience Risk + Regularization

Regularization: penalty in structural risk. You can select L1 norms of the parameter vector and L2 norms of the parameter vector.

Regularization is used to select a model with both empirical and structural risks.

Regularization complies with Occam's Razor (ockham's razor): It is a good model to better interpret known data and make it easier.

Datasets are usually divided into three parts: training set, validation set, and test set, model Selection and model evaluation. However, when the data is insufficient, it is obviously unscientific to split the data. Therefore, the cross-validation method is introduced. cross-verification is divided:

Simple cross-validation: divides a dataset into a training set and a test set.

S-fold cross-validation: The dataset is divided into S subsets of the same size, select the S-1 subset training model, the remaining subset test model; repeat s and then select.

Leave a cross verification record: used when data is missing. It is a special case for S-fold cross-verification. S = n.

1.6 generalization ability

Mark.

1.7 generate a model and a discriminant model (mark this part will be detailed later)

The model learned by generative approach is called the generative model. The probability distribution p (x, y) is combined by data learning ), then obtain the conditional probability distribution P (Y | x) as the prediction model, that is, P (Y | X) = p (x, y)/p (x ), typical generative models include Naive Bayes model and hidden Markov model.

The model learned by the discriminative approach method is called the discriminative model. Data Directly learns the decision-making function f (x) or the conditional probability distribution P (Y | X ), typical discriminant models include K-Nearest Neighbor algorithms, perception machines, decision trees, logistic regression models, Maximum Entropy Models, Support Vector Machines, lift methods, and Conditional Random Fields.

When there are hidden variables, you can still generate method learning. In this case, the discriminant method cannot be used (Mark Why ?).

1.8 classification problems

A classification model is a classifier. A learning classification model is a learning classifier.

The metrics for evaluating classifier performance are generally classification accuracy (accuracy), that is, the ratio of correct classification samples to total samples.

Common metrics for binary classification problems are precision and recall. The F1 value indicates the harmonic mean of accuracy and recall.

Statistical Learning methods that can be used for classification include K-nearest neighbor, sensor machine, Naive Bayes, decision tree, decision list, logistic regression model, support vector machine, lifting method, Bayesian network, neural Networks, winnow, etc.

1.9 annotation problem (Mark temporarily)

Tagging is a promotion of classification issues.

1.10 Regression Problems

Regression is used to predict the relationship between input and output variables, that is, to select a ing function between input and output variables, which is equivalent to function fitting, select the Function Curve to fit known data and make a good prediction of unknown data.

Based on the number of input variables, regression is classified into one-dimensional and binary regression. Based on the model type, regression is classified into linear and non-linear regression.

 

Summary: This is a summary of the chapter on Statistics learning. It is mainly about understanding rather than understanding statistics. Since we need to study statistics in depth, it is necessary to take a look at it first, this part also has some points that are not quite understandable. After all, I just took a general look at it. I feel that when I learn the statistical methods, I will gradually look back at these things, we should have a deeper understanding.

Statistical Learning methods introduced in this book: Sensor machine, K-Nearest Neighbor Method, Naive Bayes, decision tree, logistic regression and maximum entropy model, support vector machine, lifting method, EM algorithm, hidden Markov Model and Conditional Random Field.

 

 

Statistical Learning Method notes <Chapter 1>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.