Course introduction:

The main content includes a brief review of Linear Classification and linear regression analysis, as well as a detailed explanation of Logistic regression analysis, error measurement, and algorithm, at the same time, the generalization method of nonlinear transformation is analyzed.

Course outline: 1. Review

2. nonlinear transform

3. The model about Logistic Regression

4. Error Measure about Logistic Regression

5. Learning Algorithm about Logistic Regression

6. Summarize

1. Review

Linear Model classification:

1. Linear Classification (Linear classification model, sensor)

2. Linear Regression (Linear Regression Model)

3. Logistic regression (symbolic logistic regression model)

4. Nonlinear transforms (nonlinear conversion)

Among them, we have learned 1 and 2. Point 4: nonlinear conversions are also exposed. We once said that nonlinear transform is very important, but the discussion of this model is over. The model is described in detail below

2. nonlinear transform

**Definition:**

By using the conversion function (PHI), we can convert the data that cannot be linearly divided in space X to the Z space that can be linearly divided (theoretically, any non-linear data that can be divided can be converted to a higher linear order ). can be divided into spaces ).

That is:

X = {x0, X1, x2. .. XD} --- Phi ---> Z = {z0, Z1, Z2.... ZK}; k> d

For each Zi, there is ZI = Phi I (X ).

For sensor models, DVC = d + 1 is available in X space, and DVC <= k + 1 is available in Z space. This is because the Z space is transformed from the X space, and the Zi is related to X (zi = Phi I (x). In this case, possible DVC <k + 1 ■

**Cost:**

Since nonlinear transform increases the number of parameters, more data is required for training. When the data is limited, it will be the night...

**Traps:**

When using nonlinear transform, we should guard against some existing traps to avoid algorithm failure. Let's take a look at several examples:

**1. It can be basically divided**

In this space, the data is basically linearly divided, except for two points. Therefore, Ein> 0. If we insist that: Ein = 0, nonlinear transform is required. The conversion result is as follows (converted to level 4 SPACE): Obviously, this conversion weakens the generalization ability of the model. Therefore, we need to be careful with the conversion. Only when linear division is impossible or difficult in the current space. For example, the second example is as follows.

**2****, Basic score:**

In two-dimensional space, the data cannot be linearly divided. Therefore, it is difficult to use linear models for approximation and generalization in this space.

Assume that X = {1, x1, x2} in this space }.

In Z space, there are Z = {1, x1, x2, X1 * X2, X1 ^ 2, X2 ^ 2}

However, we changed the number of parameters from three to six. We intuitively felt that we needed three times the original data for training. However, why do we not need the following space:

Z = {1, X1 ^ 2, X2 ^ 2} // changed to three dimensions

Or Z = {1, X1 ^ 2 + x2 ^ 2} // Oh, fewer,

Even z = {x1 ^ 2 + x2 ^ 2-0.6} // Well, there is only one dimension left .....

Why can't I perform the preceding conversions?

Never forget that our goal is not to fit the data in the sample, but to find a model that can reflect the external data of the sample. Using the above "improvement" model will weaken its generalization ability. Because we select a model based on the existing data, the data has affected the model selection. In addition, if we reduce the dimension, we will violate the assumption about DVC, because DVC is smaller than in the original space. We should hand over this work to machine learning. Through Learning, machines will tell us which parameters are 0, which parameters are not important and so on (if our machine is designed well enough ). The above behavior is called data snooping ).

**Remember**: When selecting a model, we should not be affected by data. The best way is not to look at the data,

3. The model about Logistic Regression

In Linear classification, our model is: h (x) = Sign (wx), where W and X are vectors, and sign indicates taking symbols. Therefore, h (x) = + 1 or h (x) =-1.

In linear regression, our model is h (x) = wx. For this model, the range of h (x) is not limited, depending on the selection of W parameters.

In the logistic regression model, the selected model is h (x) = θ (S) = E ^ s/(E ^ s + 1) where S = wx.

There are: θ (-S) = 1-θ (s) and 0 <θ (s) <1. The function image is as follows. Why choose this model instead of others? The main reason is that this model has many advantages during error analysis, which can simplify our analysis.

This model is also called soft threshold, indicating uncertainty. It is a bit like the linear regression model, but the result is normalized to 0-1. Therefore, we can interpret the value of this model as a probability value. 0 indicates that the prediction result is close to-1, and 1 indicates that the prediction result is close to + 1. Therefore, this model provides more information than linear classification. He only tells us the possibility of occurrence, not a real result. For example, we should not use the linear classification model to predict the likelihood of recurrence of a patient's heart swelling disease within one year. Because there are many factors that affect the recurrence of a person's heart bloat disease, we cannot predict whether a recurrence will occur in one year.

The data we input is a binary classification, and the learning result is a probability value. It is because of noise that the prediction result will change under the influence of noise.

4. Error Measure about Logistic Regression

To get a probability model, consider the following equation:

| -- F (x) for Y = + 1;

P (Y | X) = |

| -- 1-f (x) for y =-1;

F (x) indicates the target function to be learned. P (Y | x) indicates the probability that y will occur under X. The larger P (Y | X), the more likely y is to occur.

Now h (x) is used to approximate f (x), so there are:

| -- H (x) for Y = + 1;

P (Y | X) = |

| -- 1-h (x) for y =-1;

From h (x) = θ (s) and θ (-S) = 1-θ (s), we can simplify the preceding equation as follows: P (Y | X) = θ (Y wx)

For the entire dataset D (x1, x2, x3... XN), the probability of an event is: (here we will explain that the probability of an event is the probability that the predicted probability is close to the true probability of the event. The larger the probability, the closer the predicted probability is)

Of course, we hope that the higher the possibility of an event, the better. Therefore, we need to maximize the possibility of an event. This is also the Guiding Direction of machine learning, that is, finding a w to maximize the entire equation. To maximize the above functions, the error is actually minimized: F (x)-h (x)

The following equation transformation can be used to obtain an error measurement:

To maximize:

To maximize:

It is equivalent to minimizing:

After θ (s) is substituted, the error measurement function is also called cross-entropy error (cross entropy error measurement)

Because we need a simple and concise error measurement method, which must be consistent with the original method in form, the above error measurement method is reasonable.

The model has been proposed, and the error measurement method is available. The rest is the learning algorithm.

5. Learning Algorithm about Logistic Regression

In order to find the best W, the specific idea is to set an initial value W (0) for W. Then, through machine learning, continuous iteration, and finally find a "best" W. In each iteration, w (1) is the result of one iteration of w (0. Here, we use the Gradient Descent Method: Ein (w (1) = ein (w (0) + θ V), where ETA indicates the step size of each movement, V is a vector that represents the direction of movement. For the sake of convenience, here we set ETA as a constant, that is, the step size remains unchanged each time, so now we only need to determine a variable v. When V is determined, we can perform the next iteration. The derivation process is as follows:

From the first step to the second step, the Taylor formula is used for conversion. From step 2 to step 3, the inequality is true because V is a unit vector, so the minimum value can be obtained when the direction is the opposite. Because we ignore the second item, there is an inequality. Therefore, you can determine the value of v.

If the number of K is too small, moving only one small step at a time may cause the algorithm to fail (N years later ...), however, if it is too large, it is possible to skip the minimum value and continue to wander or go somewhere else. A balance between time and performance can be achieved only when the value of ETA is moderate. This requires some experience and luck. Therefore, it is not ideal. If we make the value of θ dynamically change, when the gradient increases, it increases and when the gradient decreases, it decreases, in this way, a good balance can be achieved. For more information, see:

For the sake of convenience, let's make it a * | ▽ ein (w (0) |, where K is a proportional constant,

Therefore, V = K * ▽ ein (w (0), so that ETA = K, V = ≈ * ▽ ein (w (0 )). (now, this ETA indicates the learning rate, which is different from the previous meaning)

Next we will use algorithms to describe the above process

**Algorithm Description:**

To terminate an algorithm, you can use the following methods:

1. Until the best W is found, the effect is the best, but it cannot be determined when it will be terminated (there may be no limit ...)

2. Set a threshold and terminate when the Ein value is smaller than the threshold value.

3. Limit the number of iterations and terminate when T is greater than a certain value.

In addition, it should be noted that this method can only find the local optimal value rather than the global optimal value at a time, but we can do multiple experiments, specify different initial values each time, and then take the smallest result as the output.

6. Conclusion:

The output of the three linear models in specific applications and the error measurement methods used are summarized. Different error measurement methods may lead to completely different results. A good measurement method is a machine learning booster. Many scholars are studying the error measurement method, hoping to find the most suitable Error Measurement Method for specific situations. Now, all the courses for linear models have been completed, followed by the neural network section.

Caltech Open Course: machine learning and Data Mining _ Linear Model II (Lesson 9)