Statistical Learning Method One: Foundation

Last Update:2016-07-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A summary of the basic concepts and theories in statistical learning methods. Incrementally updated.

Content from the "statistical learning method" in the first chapter, the first chapter is basically all important content, so this blog is a join their own understanding of the idea of reading notes.

What kinds of statistical learning methods are included?

　　Supervised Learning: The data sets used for learning are input \ Output data pairs (labeled samples), and the learning task is to find the corresponding rules for input and output. Mainly used for classification, labeling, regression analysis.

Unsupervised learning: The data set used for learning is only input (unlabeled samples), and the learning task is to analyze the data and find the output. Mainly used for clustering.

Semi-supervised learning: is a combination of supervised learning and unsupervised learning, it mainly considers how to use a small number of labeling samples and a lot of unlabeled samples for training and classification problems, mainly used for semi-supervised classification, semi-supervised regression, semi-supervised clustering, semi-supervised dimensionality reduction.

Intensive learning: A simple understanding, learners in the process of continuous interaction with the environment, will be a certain reward from the environment, according to the reward and continue to learn, until a better strategy to achieve.

Ii. three elements of statistical learning

1. Model

(1) In supervised learning, the model is the conditional probability distribution or decision function to be studied.

(2) Hypothetical space: contains all possible conditional probability distributions or decision functions that can be defined as a set of decision functions or a family of conditional probability distributions

(3) Parametric space: Contains all the parameter vectors involved in the decision function or conditional probability distribution model

2. Strategy

With the assumption space of the model, the goal of statistical learning is to choose the optimal model from the hypothesis space, and how to choose it is the problem that the strategy needs to consider.

1) Loss function and risk function

(1) Loss function (loss functions) or cost function--measure the model of a single prediction

For a given input X, the corresponding output is given by model F (x), but the predicted output F (x) may be inconsistent with the true value Y, and a loss function or cost function is used to measure the degree of error prediction.

The loss function L (y,f (x)) is the nonnegative real value function of the predicted value F (x) and the True value Y. The smaller the loss function, the better the model.

Common loss functions:

A) 0-1 loss of function L (y,f (x)) = 1 (y<>f (x)); = 0 (Y=f (X))

b) square loss of function L (y,f (x)) = (y-f (x)) ^2

c) absolute loss of function L (Y,f (X)) = | Y-f (X) |

d) Logarithmic loss function or logarithmic likelihood loss letter L (y,f (X)) =-logp (y| X

(2) Expected risk (expected loss)

Theoretical model F (X) loss in relation to the mean value of the joint distribution P (x. x and Y)

(3) Experience risk (loss of experience)

Model F (X) on average loss of training data set

(4) Description:

Both the expected risk and the empirical risk are based on the loss function.

The expected risk is the expected loss of the model about the joint distribution--the theoretical value

Empirical risk is the average loss of the model about the set of training samples--according to the actual training set can be obtained

According to the law of large numbers, when the sample capacity n tends to infinity, empirical risk tends to anticipate risk, so empirical risk can be used to estimate the expected risk.

2) Minimize the risk of experience and minimize the risk of structural risks

(1) Empirical risk minimization (ERM): ERM's strategy considers the model with the least empirical risk as the optimal model--maximum likelihood estimation (under certain conditions)

Minimum value of the empirical risk function

When the sample size is large enough, the experience risk minimization can guarantee a good learning effect.

When the sample capacity is very small, the experience risk minimization learning effect may not be very good, and may even produce "overfitting" problem

(2) Structural risk minimization (SRM): A strategy proposed to prevent overfitting--equivalence of hermetical

Structural risk adds a regularization or penalty to the empirical risk that represents the complexity of the model

Minimum value (Empirical risk + model complexity)

3) Supervise the learning problem ———— > Experience risk or Structural risk optimization problem

3. Algorithm

The first two steps focus on choosing the optimal model from the hypothetical space, and this step considers how to solve the optimal model

As before, supervise the learning problem--optimization problem, this step focuses on how to find the optimal solution

Third, model evaluation and model selection

1. Evaluation Criteria-Error

Training error: Model's average loss on training data set (empirical risk)

Test error: Model's average loss on test data set (empirical risk)

2, over-fitting

When the complexity of the model increases, the training error decreases gradually and tends to be 0, while the test error decreases first and then increases after the minimum value is reached. Overfitting occurs when the model is too complex to pass through.

The model chosen in the study contains too many parameters (too much complexity), so that the model is well predicted for the known quantity, but it is poorly predicted for the unknown.

3. Model Selection-regularization

Regularization: Regularization is the implementation of the structure risk minimization strategy, which is to add a regularization or penalty to the empirical risk.

Regularization items: Generally a monotonically increasing function of model complexity, the more complex the model, the greater the regularization value

The role of regularization is to select a model with less complexity and empirical risk

4. Model selection-cross-validation

If a given sample data is sufficient, a simple way to make a model selection is to randomly cut the dataset into three parts, namely the training set, the validation set, and the test set. Training sets are used to train models, validation sets are used for model selection, and test sets are used for final evaluation of methods

However, due to the insufficient data in many practical applications, cross-validation can be used in order to select a good model.

(1) The basic idea: repeating the use of data, the segmentation of the given data, the segmentation of the data set into training sets and test sets, on the basis of repeated training, testing and model selection

(2) Simple cross-validation: randomly divides the data into two parts, used as training set and test set respectively

(3) S folded Cross verification: first, the data is randomly divided into S group, then the data training model of S-1 subsets, the remaining 1 subsets of the test model, the process of the possible S-choice of repeated, and finally selected S-sub-evaluation of the average test error is the smallest model.

(4) Leave a cross-validation: The special form of S-fold cross-validation is s=n, where N is the capacity of a given dataset, called ~

Iv. generalization capability

The predictive ability of the model learned by this method to the unknown data

1. Generalization Error:

In reality, test errors can be used to evaluate the generalization ability of a learning method (the empirical risk of testing a data set), but because the test data set is limited, it is theoretically analyzed:

The error predicted by the learned model to the unknown is the generalization error (the expected risk of the test data set)

2. Upper bounds of generalization error

Can be understood as the maximum possible value of the generalization error, equal to the empirical risk + a function (parameters are sample capacity and assumed space capacity)

(1) The upper bounds of the generalization error is the monotonically decreasing function of the sample capacity, and when the sample capacity increases, the upper bounds tends to 0

(2) The upper bounds of the generalization error is also a function of the assumed space capacity, assuming that the larger the space capacity, the more difficult the model to learn, the greater the upper bounds of the generalization error

V. Generation method and Discriminant method

Supervised learning methods can be divided into generation methods and discriminant methods. The models learned are called generation models and discriminant models, respectively.

1) Generation method

(1) By the Data Learning Joint probability distribution P (x, y), and then find the conditional probability distribution P (y| X) as the model of the prediction, that is, the model is generated: P (y| x) =p (x, y)/p (×)

is called a build method because the model represents the build relationship of the output y produced by the given input x

(2) Typical generation model: naive Bayesian and Hidden Markov models

(3) Advantages: can restore the joint probability distribution P (x, y), learning convergence faster, there are hidden variables, you can still use the generation method to learn

2) Discriminant method

(1) Direct learning by data decision function f (X) or conditional probability distribution P (y| X) as a predictive model, i.e. discriminant model

The Discriminant method is concerned with what output y should be predicted for the given input x.

(2) Typical discriminant model: K nearest neighbor, Perceptron, decision tree, logistic regression model, maximum entropy model, support vector machine, lifting method, conditional random field, etc.

(3) Advantages: Higher accuracy, simplified learning problems

Vi. What problems are supervised learning primarily used to solve?

1, classification problems

1) Steps: Classification problems include learning and classifying two processes, mainly used for classification.

(1) Learning process: Use effective learning methods to learn a classifier based on a known set of training data.

(2) Classification process: The new input instances are categorized using the learned classifier.

2) Training set input/output type

(1) Input: continuous or discrete variable

(2) Output: finite discrete variables

3) Evaluation criteria?

A class of concern; Negative class: Other classes

tp--the positive class is predicted as a positive class fn--the positive class is predicted as a negative class fp--the negative class is predicted as a positive class tp--the negative class is predicted as a negative class accuracy: p=tp/tp+fp--predicts the correct positive class/prediction as the total number of positive classes Recall rate: r=tp/tp+fn--predicts correct positive class/Total positive class F1 value: 2/F1 = 1/p+1/r--accuracy and recall of the harmonic mean, two high, F1 will be high

4) What are the methods?

K nearest neighbor, Perceptron, naive Bayes, decision Tree, decision list, logistic regression, support vector machine, lifting method, Bayesian network, neural network, etc.

5) What are the applications?

Classification algorithm is mainly used for classification, usually including two classification and multi-classification of two, multi-classification is the classification of a number of categories.

Many applications: text categorization, customer type classification, etc., all classification issues

2. Regression problems

1) Step: Regression problem is divided into learning and forecasting two processes, mainly used for forecasting.

(1) Learning process: Learning a model based on the training data set, i.e. function y=f (X)

(2) Prediction process: For the new input x, according to the learning model, determine the corresponding output y

Regression problem learning is equivalent to function fitting, selecting a function curve to fit well known data and predicting unknown data well.

2) Training set input and output type:

Input: Continuous type variable

Output: Continuous type variable (different from classification)

3) What are the types

(1) According to the number of input variables, can be divided into one-element regression and multivariate regression

(2) According to the type of relationship between the input variable and the output variable, it can be divided into linear regression (function is a straight line) and nonlinear regression (function is a curve)

3, Labeling problems

1) Steps: Divided into learning and labeling two processes, mainly used to give the observation sequence of the marker sequence

(1) Learning process: According to the training set, learning to get a conditional probability distribution model

(2) Labeling process: A new input observation sequence, according to the probability distribution model obtained by learning, to find the corresponding output marker sequence

2) Training set input and output type:

(1) input type: One observation sequence

(2) Output type: A marker sequence or a sequence of states

3) Evaluation index: Labeling accuracy, accuracy, recall rate

4) commonly used statistical learning methods: Hidden Markov model, the condition with the airport

5) Application: Information extraction, Natural language processing (pos tagging)

Statistical Learning Method One: Foundation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Statistical Learning Method One: Foundation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Statistical Learning Method One: Foundation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support