At the time of the holiday at home, it was finally over the supervised study of Stanford's machine learning public class (Andrew NG), which lasted nearly a semester. Unsupervised learning and reinforcement learning part of the moment is not too much to see, at present, supervised learning and my current situation more fit. After watching the supervised learning section, the understanding of machine learning is a bit deeper, coupled with the former to help seniors do experiments, from the engineering perspective and theoretical perspective together to promote, feeling is very good.

In order to consolidate the results of the study, in addition to brush the topic, ready to write a series of notes, self-derivation of the machine learning more important algorithms, and attach their own understanding. I've always thought that what I can say is my own. Writing notes helps me clear my mind and, of course, I hope to help Yimeimei.

There are several algorithms to prepare for careful deduction, which are to be completed before the start of the September semester:

1. Gradient Descent

2.Logistic Regression

3.Naive Bayes

4.SVM

5. Learning theory

==================================

**1. Machine Learning Tasks**

The task of machine learning is quite clear, my understanding is: through some algorithms (models), to dig out the data of the latent law (characteristics), and can be reasonable speculation (prediction) of the unknown data. Give a simple example to illustrate the problem.

A homeowner in Portland, USA, wants to sell his house, and he doesn't know how much his house can cost. He found some housing transaction data and wanted to make a reference. The data is as follows

Here there is the relationship between the size of the house and the price, found to meet my intuition: the larger the house, the more expensive to sell. But is that really the case? We use an area size of x-axis , the price is the y-axis, the data is drawn out, such as

Well, basically in line with our expectations, the price of the house is positively correlated with the size of the area. The above process is made sense , in fact, we inadvertently followed a process, such as:

In the figure,trainging set is the meaning of the data set, which is the data of the previous home transaction (the house area-the price) that the American is looking for . Learning algorithm is the process of learning, he through the intuitive reasoning and drawing way, the "bigger area, the higher the price" of the market law, that is the H, that is, hypothsis. Then he took a look at his house area X, according to the market rules, to estimate how much his house worth, is the last y. So, a machine learning problem arises.

Now we look back and see this example of the size and price of the house, in the light of my own understanding of machine learning: through some algorithms (models), to dig out the underlying laws (features) of the data, and to make reasonable guesses (predictions) of the unknown data. This is actually a two-step process:

Step 1: According to the existing housing transaction data, to obtain a market price law;

Step 2: According to the market price law, and the size of their own houses, to predict the value of their homes.

These two steps are a general step in a machine learning problem, and we'll look down.

**2.Linear Regression linear regression**

But we all know that the price of a house is not only affected by the size of the area, such as the number of bedrooms, geographical location, lighting, garages, new houses or second-hand housing and so on these factors will be around the price. Then we'll consider a simple factor here: the number of bedrooms. As usual, get some data first:

Then the drawing to find the law, but we found that the figure here is not very good painting, because it is three-dimensional, drawing is three-dimensional, not intuitive. How about such data analysis is better? We can use the mathematical method. In the two-dimensional case (housing area -price), we get the market price law is positive correlation. Add a new feature here (number of bedrooms), we intuitively understand that should also be the number of bedrooms, the higher the price, is also positive correlation. In this way, the housing area and the number of bedrooms, the relationship between the two and the housing price should also be positive, because the reality is so, the three-bedroom house is more expensive than a room and a hall.

To simplify the problem, we will use linear relations to express this positive correlation, that is, the relationship between the housing area, the number of bedrooms and the house price is linear. How do you define the problem mathematically?

We have learned that a straight line can be expressed in F (x) =ax+b, where we take the same approach and use the following formula to represent the price:

Can think of that is the ax, that is the B, and is the house price. X1 is the first feature, that is, the size of the house;X2 is the second feature, the number of bedrooms. There are three variables here, Theta, theta1 and theta2. Here we find that there is no x0, it is assumed that the x0 is 1, you can arrange the above formula to look like this:

, where both Theta and X are a matrix.

As we said above, machine learning has two steps. In this formula, two steps do the following:

Step one, is through the data (x) Learning Law (theta), Get f (x) =ax+b in A and B, assuming a=2,b=3,f (x) =2x+3.

Step two, your House area x=150, then the price of the predicted value is F (150) =2*150+3=330.

In this example, we mainly describe the two general steps of machine learning, namely model learning and result prediction. As for linear regression, my personal understanding is that the relationship between eigenvalues and results is linear, with this linear model to predict, and the effect is to make the predicted value as close as possible to the real value. The closer we get, the more accurate our predictions are, and the better our model works.

So the question is, how do we make predictions accurate? In fact, we can tell from the formula that the model we are talking about is theta value. How to make the theta value is the main task of our next step.

**3.Least Mean Square Minimum Average method**

How to define the "quasi-prediction" in mathematics ? Of course, the difference between the predicted value and the real value, that is, the error =| the true value -the predicted value |. But in fact, we are not so simple and rude to use the difference to express the error, often will be packaged, mainly for the convenience of subsequent calculations, here we use the least-squares method to define the error:

, let's take a closer look at this formula. indicates that the actual value of the training data of the first and the error of the predicted value, plus a square is to eliminate the positive or negative, plus a 1/2 is to facilitate the calculation. From 1 to m Plus and after, the error of the entire data set is expressed. We want to minimize this value (LSE's purpose), which means that we have a good idea of each of the data sets on the data set, in other words, our model is fine.

Well, our goal is to choose the right theta, to make J (theta) The smallest, so what is the method to use? This is the main point I want to record today-the gradient is falling.

**4.Gradient Descent Gradient Drop**

Let me start with a scene that will help us understand the gradient drop. When we are driving, according to the road, the heart will have a feel appropriate speed. If it is slower than this, we will step on the accelerator to speed up, the more slowly, the more the throttle on the deeper, if a bit fast, will be on the brakes or walk for a while to speed down.

Here, if we say that the right speed is the true value, and the actual velocity is the predicted value, the difference between the two is the error. When the error is greater, our throttle guess is deeper, the car can reach our hearts as soon as possible the right speed. This is an example of a gradient rise: When the error is large, the parameter theta to produce a large change, so that the output of the predicted value changes more, thereby reducing the error, when the error is small, the parameter theta to ensure fine-tuning, so that the predicted value closer to the real value.

The above-mentioned is the principle of gradient descent/ascent, the formula is as follows:

Can you see? We mainly look at the partial differential part. J (theta) is the error, to it partial differential, get the current speed and heart ideal speed difference of how big, is the slope. Alpha is the step size, which can be understood as the unit length of the throttle pedal. For example, thealpha step is 1 centimeters, that is, 1 centimeters per step, and the partial differential gets a value of 10, and you need to step down 10 1 centimeters. Then look at the Theta (j) on the left of the minus sign, which represents the current throttle depth, assuming 5. 5+10=15, that is, our theta to change from 5 to 15, the throttle to step deeper, to make the speed faster, so that the error is smaller. In this way, we updated the theta once.

OK, do you understand? The formula is the gradient descent formula, which is the minus sign. And I said the scene is a gradient rise, is a plus, that is, down again on the accelerator, let the speed increase, as soon as possible to reduce the current speed and my expected speed gap.

After a good understanding of the gradient of ideas, we understand that the gradient can be reduced by the method to update the theta value, so as to optimize the model, improve the accuracy rate. Here we can see that the source of the update theta is the error, which is the error between the predicted value and the actual value. Here we are equivalent to take the actual value as a reference, and constantly to correct the model, which is called supervised learning. Supervision is embodied in, "I know the correct answer (the actual value), you now the model of the output (predicted value) is not right, still far away, need to modify, go back to it." "So what does unsupervised learning look like?" Is "There is no correct answer in itself, I do not know right." As long as you do your best to follow your thoughts, then you feel right. "Intensive learning, I also know not too much, is another form of supervised learning, do a good job to reward, do not hit the face, should be more than gradient method update parameters More ruthless, do not understand I will not talk nonsense."

Well, just to be clear about the gradient method, start to deduce several forms. The formula is always not good-looking, put him a little bit smaller, bring into J (theta), what will become?

First, let's consider the simplest scenario, where there is only one training data (x, y), then it will look like

, this derivation process is not much to say, very clear. The interesting thing is that we should pay attention to the final result,alpha* error * Training input value XJ, is the XJ corresponding to the Theta (j), that is, the throttle should be more down on the depth. In this way, everything is known, is not very cool.

Of course, how can our data have one, of course, there are a lot of lines x and y, that's it:

Is there a problem? It may not be very good to understand the XJ (i) here. Here x (i) is the meaning of row I data, where there is x1,x2 in this line of data . Xn a characteristic. The first feature of the J is XJ (i). OK? Then why Theta (j) Not Thetaj (i)? Let's take a look at what Thetaj is,Theta is the weight of the eigenvalues, is to train a theta through a lot of data , so that the theta as much as possible to meet the needs of all training data, so that the overall error is minimal.

We can see that there are two main factors that determine the size of theta. One is the step Alpha and one is the error. When the error is getting smaller, the gradient (partial differential, or slope) is getting smaller, so the depth of the accelerator needs to be lower and smaller, and the resulting speed changes are getting smaller, almost reaching our heart speed. In this way, we call the algorithm convergence, that is, to achieve the speed we want, that is, the smallest error, our model training success.

This gradient drop is called batch gradient descent batch Gradient descend, which means all data.

I do not know if there is a problem, that is, when we use BSD, we need to traverse all the data in order to get a set of theta values. This is certainly the most accurate result, but the efficiency is very low, when the training data is particularly large, it may not be a long time to produce results. So we also give an improved algorithm called random gradient descent Stochastic gradient descent.

What does this mean, we can see that the loop inside is a single (x, y) formula, and this process is carried out m times. That is, each iteration produces a set of theta, and unlike BGD, it facilitates all data to have a set of theta. The advantage of this is that if the algorithm is found to be convergent, it can be exited, which greatly saves time. But the accuracy may be slightly worse, but from the actual effect, the effect is fine.

OK, summarize. I took the first step today. Machine learning tasks: algorithms, mining and prediction, and machine learning steps: Finding parameters, making predictions, and then how to evaluate the quality of an algorithm:LMS and error, and how to improve a model: the Gradient method.

Still need to write out, like writing gradient, the accelerator example is what I think, I think it is very interesting. After writing, I also corrected a lot of my own knowledge, such as I did not think of Theta (j) Why not Thetaj (i), I think again, ah yes, that is the weight of it, to learn is this, but only through a large number of data to learn this set of parameters just. The day is a little fruitful, always very happy. Bread to refuel ~ ~

Machine learning derivation note the task, step, linear regression, error, gradient descent of machine learning