"PRML Reading notes-chapter1-introduction" 1.5 decision theory

Source: Internet
Author: User

First Experience:

Probability theory provides us with a unified framework for measuring and controlling uncertainty, which means calculating a lot of probabilities. So, how to get better results based on these calculated probabilities is the thing that decision theory should do.

An example:

The article cited an example:

Given an X-ray X, the goal is to determine whether the patient has cancer (C1 or C2). We see it as a two classification problem, according to Bayes ' Probability theory model, we can get:

Therefore, it is a priori probability; (assuming that CK denotes illness, then it indicates the probability of common disease)

Then it is a posteriori probability.

Suppose our goal is: Given an x, we want to minimize the probability of mis-categorization, so our intuition tells us that we should choose the one with the larger posteriori probability as our final decision class. This is the theory of decision-making.

Next, let's say why this intuition is correct.

Minimizing the probability of mis-categorization (1.5.1 minimizing the misclassification rate)

Assuming our goal is simply to minimize the probability of mis-categorization, we need to establish a rule for the two classification problem: dividing the entire input x into two spaces, which correspond to two classes respectively. Which area the new input belongs to, and which category it belongs to. In these two areas we call him the decision area(decision Regions) (the sub-regions belonging to the same category can be disconnected), and the boundaries of the two regions are called decision boundaries (decision Boundaries).

In order to find the optimal rule, we can represent the probability expression of the mis-classification:

Where RK represents the decision category and CK represents the actual category.

So, if known, then, for the sake of minimization, we should make the decision that x belongs to C1. Because:

Suppose we make the decision that x belongs to C1, then = 0, i.e.;

Similarly, assuming that our decision X belongs to C2, then = 0, that is.

And because, everyone is the same, so the problem of minimizing the probability of mis-categorization translates into what kind of big problem. As shown in the following:

Description: If you take the decision as a boundary, then

X>, then the decision is class2,x<, then the decision is Class1.

Mainly look at the red area, obviously, when x0<, can be found, p (X,C2) >p (X,C1), but because of the boundary, so the probability of classification is rising, in order to make the red area of the smallest area, that is, to achieve the smallest probability of mis-classification, we should put the decision boundary to x0, so , you can make the area of the red area 0.

For Multi-classification problems, it is easier to calculate the probability p (correct) for the correct classification.

Minimizing desired losses

Minimizing the probability of mis-categorization is a relatively simple strategy, and we often encounter better and more complex situations, when we need to adopt some other strategies to solve them.

Or is it just a case of illness? If a healthy person, you are diagnosed with a disease, then he can be a closer examination to determine whether the disease is true, but if it is a patient, you give the diagnosis is not sick, then, it is very likely because of your misdiagnosis, missed the best treatment time, Ultimately, it leads to serious consequences. Therefore, it is also misdiagnosed, but the price they pay is not the same (the cost of the latter is clearly higher than the former).

Therefore, in order to reflect this situation, the concept of the loss function (loss functions) is raised. Further, we introduced the loss matrix (loss matrix), where each element represents the loss that is caused when a sample's real class is K, and the decision is J.

Patients with cancer who are misdiagnosed as healthy are at a cost of 1000, while healthy people are at a cost of 1 when they are misdiagnosed for cancer.

So, the goal of our optimization this time is to minimize the loss function (loss functions):

Reference:

We can rewrite the above style as:

?

And because

where P (x) is all the same, so the problem of minimizing the loss function translates to:

Minimization of

?

? Stop decision (threshold value)

Given an input x, if <θ or ? < θ, it means that there is a bit of ambiguitybetween the two classes (ambiguous). In this case, we stop the machine to make a judgment and make a manual decision.

Reasoning and decision Making

So far, we have divided the classification problem into two stages:

?

1. The reasoning stage (inference stage): Build the model with the training set;

2, decision-making stage (decision Stage): using these posterior probabilities to obtain the optimal decision (classification results)

In fact, there are 3 completely different ways of thinking when it comes to solving decision problems: (complexity is high to low):

1, generate the model (generative models): the input data and output data modeling, so we can generate some new input data points according to the model;

??? A, first of all, for each class to calculate a and;

??? b, calculate the posterior probability: ();

Features: More laborious, involving X and CK joint probability, but we can get some additional information, such as can be obtained by normalization, so as to understand a sample point to be measured is the likelihood of a noise point (noise detection).

2, discriminant model (discriminative models.):

? ? A, to modeling;

? ? b, directly specify the type of input x;

3. discriminant functions (discriminant function):

? ? is a mapping function, enter an x, and output a label.

Features: direct, no need to calculate the posteriori probability

Although the discriminant function is relatively straightforward, it is not necessary to calculate the posterior probability, but it is very important to calculate the posteriori probability:

1, minimize the risk of experience: if the loss matrix changes frequently,

???????? to build the model: just modify the minimum risk decision indicator (minimum risk decision criterion).

???????? discriminant function: Back to the training data, run again;

2. Stop decision (threshold value):

?????????????

3. Additions to the category priori:

? ? ? ? ? ? ? ? If the class is extremely unbalanced, the accuracy of the model is reduced, affecting the final result. So, what we want to do is to make the classes as balance as possible. Therefore, we can obtain the posteriori probability from the data set of the artificially constructed balance, and then multiply the actual priori. (I don't quite understand that) (We can therefore simply take the posterior probabilities obtained from our artificially balanced data set and First divide by the class fractions in that data set and then multiply by the class fractions in the population to which W e wish to apply the model. )

4. Synthetic Model:

???????? for complex applications, the decomposition of large problems into independent small problems to solve. Example: Suppose the disease is related to X-ray maps and blood information. In this way, we can model the X-Ray graph XI and the blood information xb separately, they are 22 Independent:

?

There is a better example of the difference between a discriminant model and a generated model:

Suppose you now have a classification problem, X is a feature, and Y is a class tag. A joint probability distribution P (x, y) is studied with a generative model, and a conditional probability distribution P (y|x) is studied by discriminant model.
use a simple example to illustrate this problem. Assuming that X is two (1 or 2), Y has two classes (0 or 1), with the following sample (1,0), (1,0), (first), (2,1)
The joint probability distribution (generation model) learned is as follows:
-------0------1----
--1------?
--2--0------
and the conditional probability distribution (discriminant model) learned is as follows:
-------0------1----
--1--2/3---1/3?
--2--0---1
In the actual classification problem, the discriminant model can be directly used to judge the category of the feature, and the model should be added with the Bayesian law and then applied to the classification. However, there can be other applications for the probability distribution of the generated model, which means that the generation model is more general and pervasive. But the discriminant model is more straightforward and simpler.

The loss function in regression

I've been talking about classification issues, and now we're going to the regression question:

Classification and regression not clear what the meaning can be referred to the description: {

the difference between classification and regression is the type of the output variable.

the quantitative output is called regression, or continuous variable prediction;
qualitative output is called classification, or discrete variable prediction.

As an example:
predict tomorrow's temperature is how many degrees, this is a return task;
predicting whether tomorrow is cloudy, sunny or rainy is a sort of task.

Decision-making phase: Select an appropriate mapping to make Y (x) =t

If the predicted result and the actual result of the square difference as a loss function:

?

Our goal is of course to minimize the loss function, namely: minimizee[l], if the partial derivative of y (x), that is, Y (x) equals how much, the minimum value is obtained (partial derivative is 0):

Translate to:

: When x=x0, it is distributed as a blue line

In addition, there is a different way to get the result: The formula can be written like this:

?

? to minimize the above loss function, the first right of the equation is when the minimum value is zero, and the second item indicates the fluctuation (variance) of the response value t of the input variable x in the desired meaning. Because it is only related to the joint probability distribution , it represents the portion of the loss function that cannot be reduced.

Similar to the classification problem, regression problems can be divided into 3 different methods (complexity from high to low):

1, the joint distribution:, normalization of the conditional distribution, and finally get the conditional mean value;

2, the condition density is obtained directly , and then the conditional mean value is obtained;

3, directly find a Y (x) mapping, X for the training set;

Of course, finding the squared difference is just one of the loss functions, and many times it's not accurate and elegant. So, we need some other error function. For example, Minkovski error (Minkowski loss):

?

?

It can be found that it is a more general form, when q=2 is the square loss.

?

?

?

?

?

?

?

Decision boundaries?

"PRML Reading notes-chapter1-introduction" 1.5 decision theory

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.