Machine learning– 2nd week

Last Update:2016-06-29 Source: Internet

Author: User

Tags ranges square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Octave for Microsoft Windows

MathWorks

Linear Regression with multiple Variables Environment Setup Instructions

Setting up Your Programming assignment environment
Installing Octave/matlab on Windows
Installing Octave/matlab on Mac OS X (10.10 Yosemite and 10.9 Mavericks)
Installing Octave/matlab on Mac OS X (10.8 Mountain Lion and Earlier)
Installing Octave/matlab on Gnu/linux
More Octave/matlab Resources (* *)

Multivariate Linear Regression multiple Features

In this video we will begin to introduce a new, more efficient form of linear regression, which is suitable for multiple variables or multi-feature quantities.

For example, in the linear regression we learned before, we only have one single feature. Housing area x We want to use this feature to predict the price of the house. That's our hypothesis.

But imagine that if we not only have the housing area as a feature or variable to predict the price of the house, we also know the number of floors in the bedroom and the life of the house, which gives us more information to predict the price of the house.

Let's start with a brief introduction to the notation when we started, we mentioned that I want to use x subscript 1 x subscript 2 and so on to represent the four eigenvalues in this case and still use Y to represent the output variable we want to predict. Take a look at more representations now we have four characteristic quantities

I'm going to use lowercase n to represent the number of feature quantities so in this case our n equals 4 because you see we have 1 2 3 4 Total 4 features here n is different from the n we used before

We used the "m" to represent the number of samples, so if you have 47 rows, then M is the number of rows in this table, or the number of training samples.

Then I'm going to use x superscript (i) to represent the input eigenvalues of the I training sample for a specific example x superscript (2) is the characteristic vector representing the second training sample so here X (2) is the vector because these four numbers correspond to the second house I used to predict house prices four features so in this notation This superscript 2 is an index of the training set instead of the X 2, this 2 corresponds to the second line in the table you see, my Second training sample x Superscript (2) This means that it is a four-dimensional vector in fact more generally this is the N-Dimension vector with this notation x superscript 2 is a vector so I use x superscript (i) subscript J to represent the first I training sample, J features so specifically x superscript ( 2) Subscript 3 represents the 3rd feature in the 2nd training sample, right? This is 3. I don't really look good, so say x superscript (2) Subscript 3 is equal to 2

Now that we have more than one characteristic, let's continue our discussion about what our hypothetical form should be. This is the hypothetical form we used before, X is our only feature , but we have multiple features and we can't Using this simple notation instead we will change the linear regression hypothesis to such θ0 plus θ1 multiplied by X1 plus θ2 times x2 plus θ3 multiplied by X3 plus θ4 multiplied by X4 Then if we have N features then we're going to have all n feature quantities Add instead of four feature quantities we need to add the N features

To give a specific example in our set parameters we may have h (x) equals + 0.1 x1 + 0.01x2 + 3x3-2x4 This is a hypothetical example do not forget to assume that the price of a house is to predict about thousand knives It's a K plus 0.1 times x1 That means $100 per square foot and then the price will continue to grow as the number of floors increases x2 is the number of floors and then the price will continue to increase as the number of bedrooms increases because X3 is the number of bedrooms but the price of the House will depreciate as the number of years of use increases.

This is a re--rewritten form of assumptions next I'm going to introduce a little simplify the representation of this equation for convenience I'm going to set the value of x subscript 0 to 1 Specifically this means that for the sample I there is a vector x superscript (i) and x superscript (i) subscript 0 equals 1 You can think we defined an additional No. 0 feature so I used to have n features because we had X1 x2 until xn because I defined additional NO. 0 special Levy Vector and its value is always 1 so my present eigenvector x is a vector of n+1 from the beginning of 0, so now is a n+ 1-d feature vector but I'm going to start with 0, and I want to think of my parameter as a vector so our parameters are our θ0θ1θ2 and so on until Partθn we're going to write all the parameters into a vector θ0θ2 ... a Until Partθn here there is also a vector subscript starting from 0 to start with 0 this is another so my hypothesis can now be written θ0 multiplied by x0 plus θ1 times x1 until partθn times xn this equation and above The equation is the same because you see x0 equals 1

Now I'm going to put this assumption equation into theta transpose multiplied by x depending on how familiar you are with the inner product of the vector if you expand theta Transpose by x then you get θ0θ1 until partθn this is theta transpose actually this is a n+1 multiplied by a 1-dimensional matrix is also called a row Vectors are multiplied by the x vector by the line vector x vector is x0 x1 and so on until Xn so the inner product is theta transpose multiplied by x equals this equation which gives us a more convenient form of the hypothesis is to use the parameter vector θ and the inner product of the eigenvector x, which is the representation of the rewritten representation. The habit allows us to write the hypothesis in this compact form, which is the hypothetical form of multi-characteristic situations. Another name is the so-called multivariate linear regression multivariate, which is used to predict multiple features or variables is a more pleasant statement.

Gradient Descent for multiple Variables

In the previous video, we talked about the hypothetical form of linear regression, which is a form of multiple features or multivariable. In this video we will talk about how to find the parameters that satisfy this hypothesis, especially how to use the gradient descent method to solve the linear regression problem with multiple features.

To make you understand as soon as possible existing multivariate linear regression and convention x0=1 the parameters of the model are from θ0 to Partθn do not think this is n+1 a separate parameter you can think of this n+1 θ parameter as a n+1-dimensional vector θ so you can now think of this model's parameters as itself is a vector of n+1 dimensions our cost function is a function J from θ0 to Partθn and gives the sum of the squared error terms

But do not think of function J as a function of n+1 independent variables but as a function with a n+1-dimensional vector This is the gradient descent method we will continue to use θj minus the derivative of alpha times to replace the Θj same method we write the function J (θ) so θj is updated to Θj Minus the product of the learning rate α and the corresponding derivative is the partial derivative of the cost function of the θj. When we implement the gradient descent method, you can take a closer look at this, especially its partial derivatives. Below we have two update rules for parameter θ0 and θ1 that are different from the n=1. It's not a stranger to you. The result of partial derivation in the cost function is that the derivative of the cost function relative to the θ0 is the same for the parameter θ1 we have another update rule the only difference is that when we had only one feature we called the feature X (i) But now we're in the new symbol, we'll mark it as X-superscript. (i) subscript to indicate our characteristics

The above is when we have only one feature when the algorithm below we say that when there are more than one feature of the algorithm existing number is much greater than 1 of many characteristics of our gradient descent update rules become so some students may know calculus if you look at the cost function of cost functions J to the parameter θj partial derivative you will find that the partial derivative I've already used the blue coil. If you do this, you're going to get a gradient descent algorithm for multivariate linear regression.

Finally, I want you to understand why both the old and new algorithms are actually the same thing or why these two are similar algorithms why they are all gradient descent algorithms consider the fact that there are two or more number of features at the same time we have three update rules for θ 1, θ 2, θ3 and of course there may be other parameters If you look at the update rules for θ0, you'll see that this is the same as the previous n=1, and they're equivalent because there's an X (i) in our tagging convention, and the two items in the red circle are equivalent if you look at the θ1 update rule, and you'll find that this is and θ1 update entries are equivalent here we just use a new symbol x (i) to represent our first feature now we have a more characteristic then we can use the same rules as before we can use the same rule to deal with other parameters such as θ2. This slide has a lot of content.

Be sure to understand carefully if you feel that the math on the slide is not readable even though pause the video make sure you understand and continue to learn later if you implement these algorithms, you can apply them directly to multivariate linear regression.

Gradient descent in practice i-feature Scaling

In this video and in the next video I want to show you some practical tips on gradient descent operations in this video I'll show you a method called feature scaling (feature scaling) This method is as follows if you have a machine learning problem this problem has multiple special If you can ensure that these features are in a similar range, I mean to make sure that the values of the different features are within a similar range the gradient descent method can converge faster specifically if you have a problem with two features where X1 is the size of the house area Its value is between 0 and 2000 X2 is the number of bedrooms possible this value ranges from 1 to 5 if you draw the cost function J (θ) contour map then this contour should look like this J (θ) is a function of the parameters θ0θ1 and θ2 But I'm going to ignore θ0. So I don't think about θ0 and suppose a function's variable is only θ1 and θ2, but if the value range of X1 is much larger than the value range of x2, then the contour of the cost function J (θ) will be rendered as a very skewed And the proportions of the oval shape 2000 and 5 will make the ellipse more elongated so this is a thin, tall oval contour, which is the very tall, slender oval that makes up the cost function J (θ) and if you use this cost function to run a gradient drop, you're going to get the gradient value most It may take a long time and may fluctuate back and forth, and then it will take a long time to finally converge to the global minimum in fact, you can imagine if these contours were magnified a little more, if you were to exaggerate a little bit more and make it thinner and longer that might be It's going to be worse. The process of gradient descent may be slower and will take longer to oscillate back and forth to eventually find a correct path to the global minimum in such cases an effective method is to perform feature scaling (feature scaling) specifically define the feature x as the size of the house divided by 2000 and divide the X2 as the number of bedrooms divided by 5 so that the shape of the contour of the cost function J (θ) will be shifted less heavily may seem more rounded &n bsp; If you use this cost function to perform a gradient drop, thenGradient Descent algorithm you can mathematically prove that the gradient descent algorithm will find a shortcut path to the global minimum rather than just like that follow a far more complex trajectory of a confusing path to find the global minimum So through the feature scaling by "consuming" the range of these values in this example we finally get two features X1 and X2 are between 0 and 1 so you get the gradient descent algorithm will converge faster more generally we perform feature scaling when We often aim to constrain the value of a feature to a range of 1 to +1 your feature x0 is always equal to 1 so this is already within this range but for the other features you may need to divide by different numbers to keep them in the same range-1 and +1 of these two A number is not too important so if you have a feature x1 its value between 0 and 3 this is fine if you have another characteristic to value between 2 and +0.5 This is also very close to the range of 1 to +1 these can be &nbsp ; But if you have another trait, such as X3, if it's range between 100 and +100 then this range is very different from 1 to +1, so this could be a less than good feature, like if your feature is in a very, very small range like another Features x4 its range between 0.0001 and +0.0001 so This is also a much smaller range than-1 to +1-a much smaller range than 1 to +1 so I would also think that this feature is not very good so maybe you recognize The range may be greater than or less than-1 to +1 but not too big, as long as it's not too big to accept, like, +100, or not too small. For example, 0.001 different people here have different experiences, but I generally think so. If a trait is within the range of 3 to +3 then you should think This range is acceptable, but if the range is greater than--3 to +3--I might be starting to notice if it's in the range of--1/3 to +1/3 I think it's okay. Acceptable or 0 to 1/3 or 1/3 to 0 these typical ranges I think is acceptable, but if the range of features gets very small, like this,X4 You're going to start thinking about feature scaling so you don't have to worry too much about whether your features are in exactly the same range or interval, but as long as they're close enough, the gradient descent will work normally. Apart from dividing the feature by the maximum value in feature scaling, sometimes we also do a call Mean normalized work (mean normalization) I mean like this if you have a characteristic XI you're using XI-μi to replace by doing so let your eigenvalues have an average of 0 It's obvious we don't need Apply this step to x0 because x0 is always equal to 1 so it cannot have an average of 0 but for other features such as the size of the house is 0 to 2000 and if the average of the house area is equal to 1000 then You can use this formula change the value of X1 to X1 minus the mean μ1 and divide by 2000 similarly if your house has five bedrooms and the average house has two bedrooms then you can use this formula to get your second character x2 in both cases you can figure out the new features X1 and x2 so that their range can be between 0.5 and +0.5 of course it's definitely not right X2 value is actually going to be greater than 0.5 but very close to the general rule is that you can use this formula you can use (x1-μ1 /s1 to replace the original feature X1 which defines μ1 meaning in the training set feature x1 the mean and S1 is the range of the eigenvalues I mean the maximum minus the minimum value. The value minus the minimum value or the students who have learned the standard deviation can remember that the S1 can also be set to the standard deviation of the variable but its practical maximum value minus the minimum value can be similarly for the second feature X2 You can also use the same feature minus the average and divide by the range To replace the original feature range is still the maximum value minus the minimum value this type of formula will turn your feature into a range that might not be the case, but maybe it's the range. Some students may be more careful if we use the maximum value minus the minimum value to represent the range, here's 5 is probably 4 if the maximum value is 5. Minus the minimum 1 the range value is 4, but whatever it is,The values are very approximate as long as the feature is converted to a similar range. Feature scaling does not need to be too precise just to make the gradient drop run faster. Well now you know what is feature scaling by using this simple method you can drop the gradient The faster the speed becomes, the less cycles required for the gradient to converge This is the feature scaling in the next video I'll introduce another technique to make the gradient drop work better in practice

Gradient descent in practice ii-learning rate

In this video I would like to tell you some practical tips on gradient descent algorithm I will focus on learning rate α specifically, this is the update rule for gradient descent algorithm here I want to tell you how to debug that is what I think should be how to determine the gradient drop is normal work in addition I would like to tell you how to choose the learning rate α That's how I usually choose this parameter. The gradient descent algorithm that I normally work on to determine the gradient descent is to find a theta value for you and hope it can minimize the cost function j (θ) I usually draw the value of the cost function J (θ) when the gradient descent algorithm is running. Here the x-axis is the representation Iteration of the gradient descent algorithm you might get a curve like this. Note that the x-axis here is the number of iterations in the J (θ) curve we saw previously, the x-axis, which is the horizontal axis used to represent the parameter θ, but this is not exactly what this means when I run the 100-step gradient descent Iteration And then no matter what I get theta value after the 100-step iteration I will get a theta value based on the 100-step iteration of this θ value I will calculate the value of the cost function J (θ) and the vertical height of this point represents the θ of the gradient descent algorithm after the 100-step iteration of the J (θ) Value and this point is the J (θ) value of θ calculated after 200 iterations of the gradient descent algorithm so this curve shows the value of the cost function J (θ) in the iterative process of the gradient descent algorithm if the gradient descent algorithm works correctly then J (θ) should fall after each iteration

One of the uses of this curve is that it can tell you if you look at the curve I'm drawing when you reach the 300-step iteration, which is the 300-step to 400-step iteration, which looks like J (θ) doesn't go down much, so when you get to the 400-step iteration, the curve looks pretty flat. Which means that the gradient descent algorithm is basically converging here in the 400-step iteration because the cost function does not continue to fall so look at this curve can help you determine whether the gradient descent algorithm has been convergent by the way, the number of iterations required for each particular problem gradient descent algorithm may vary greatly A problem gradient descent algorithm only needs 30-step iteration to be able to converge however a problem may be that the gradient descent algorithm requires 3000 step iterations for another machine learning problem it may take 3 million step iterations in fact, it is difficult to determine in advance how many iterations the gradient descent algorithm needs to converge. Usually we need to draw this kind of curve Draw the change curve of the cost function increasing with the number of iterations usually I will try to determine if the gradient descent algorithm is convergent by looking at this curve. Alternatively, you can perform some automated convergence tests, which means that an algorithm is used to tell you if the gradient descent algorithm has converged automatically convergence test a very typical example is if the cost The decrease in function J (θ) is less than a small value ε so it is considered to have been convergent for example 1e-3 but I found that it is quite difficult to choose a suitable threshold ε, so in order to check whether the gradient descent algorithm converges I actually still look at the graph on the left instead of relying on the automatic convergence test In addition, the graph can warn you in advance when the algorithm is not working properly. Specifically, if the cost function j (θ) changes with the iteration of the curve is this way J (θ) is actually rising then it is very clear that the gradient descent algorithm is not working properly and such a graph usually means you should use a more Small learning rate α if J (θ) is on the rise then the most common reason is that you are minimizing such a function at this time if your learning rate is too large when you start from here the gradient descent algorithm may be rushing past the minimum to reach here and if your learning rate is too large you may have rushed through the minimum value to reach here and then continue to What you really want is to start slowly down here, but if the learning rate is too high then the gradient descent algorithm will constantly overshoot the minimum value and you will get more and more bad results to get more and more cost function J (θ) value so if you get such a graph if you see such a curve usually solved The method is to use a smaller alpha value and, of course, make sure your codeThere are no errors but usually the most likely error is the alpha value is too large the same sometimes you may see the shape of the J (θ) curve It first drops then rises then then falls then then rises again and then drops again rise so again and again and the way to solve this situation is usually the same choice of small alpha value I do not intend to prove But for the linear regression we're talking about, it's easy to mathematically prove that as long as the learning rate is small enough, the cost function J (θ) will fall after each iteration so if the cost function does not fall, then you should try a smaller learning rate. Of course you don't want to learn too little because So if you do that, then the gradient descent algorithm may converge very slowly if the learning rate α is too small you may start from here and then slowly and slowly move to the lowest point so you need to iterate a few times to reach the lowest point so if the learning rate α is too small the convergence of the gradient descent algorithm will be very slow to summarize as Fruit learning rate α is too small you will encounter slow convergence problem and if the learning rate α too large cost function j (θ) may not fall in each iteration or even may not converge in some cases if the learning rate α is too large may also appear slow convergence problems but more commonly you will find the cost function j (θ) and Does not fall after each iteration and in order to debug all of these situations drawing a curve of J (θ) with the number of iterations can usually help you figure out exactly what's going on. Specifically, when I run the gradient descent algorithm, I usually try a series of alpha values, so I'm going to try different alpha values like 0.001 for the gradient descent. 0.01 here every 10 times times take a value and then for these different alpha values are plotted J (θ) curve with the number of iterations and then choose an alpha value that appears to cause j (θ) to drop rapidly. In fact, I'm not usually 10 times times the value you can see here is every 10 times times take a value I usually take Is that these alpha values go on like this all the time. You see, take 0.001 and then increase the learning rate by 3 times times. 0.003 then this step is increased from 0.003 to 0.01 and it's about 3 times times more, so when I choose the right learning rate for the gradient descent algorithm, I basically take a number of multiples of 3, so I'll try a series of alpha values Until I find a value it can't be smaller and find another value it can't be bigger then I try to pick the largest alpha value or a reasonable value that is slightly smaller than the maximum value and I usually get a good learning rate when I do the above work. If you do that, then you can do the same for your gradient. Descent algorithm to find a suitable learning rate value

Features and polynomial Regression

You now know the linear regression of multivariable in this video I want to show you some methods for selecting features and how to get different learning algorithms. These algorithms are often very effective when you choose the right features and I want to tell you about polynomial regression, which allows you to use linear regression methods to fit very complex Miscellaneous functions and even nonlinear functions to predict house prices For example suppose you have two features that are the width and vertical width of the house frontage, which is the picture of the house that we want to sell. The street width is defined as the distance is actually its width or the width of the land you own if this place is yours and the house's portrait Depth is the depth of your house this is the width of the front which is the depth we call the frontage width and depth you might build a linear regression model like this. The frontage width is your first feature x1 depth is your second feature x2 but when we use linear regression, you don't have to use the X1 directly. And X2 as a feature you can actually create new features for yourself so if I'm going to predict the price of the house, what I really need to do is determine what factors are really determining the size of my house or the size of my land so I might create a new feature I call X it's the product of the frontage width and depth which is A multiplication symbol It is the product of the width and depth of the street, which is the area of the land I have. And then I can choose the hypothesis so that it uses only one feature, that is, the area of my land, right? Because the rectangular area is calculated by multiplying the rectangle length and width so it depends on what angle you look at a particular problem rather than just go straight to the street width and depth of the two features we just happen to use at the beginning sometimes by defining new features you do get a better model with the choice of features An idea closely related to a concept known as polynomial regression (polynomial regression) Let's say you have a data set for housing prices in order to fit it there may be several different models to choose from, one of which you can choose is a two-time model like this Because straight lines don't seem to fit the data well, so maybe you'd think of a two-time model to fit the data. You might consider a two-time function of the price. Maybe this will give you a fitting result like this, but then you might think that the two-time function model doesn't work because a two-time function will eventually drop Come back and we don't think the price of the house will fall back to a certain extent. So maybe we'll choose a different polynomial model and choose to use a three-time function here now we have a three-time equation.We use it to fit we might get a model like this. Maybe this green line fits the data set better because it doesn't fall back in the end. So how exactly should we fit the model with our data? Using the method of multivariate linear regression we can implement it by making a very simple modification of our algorithm in the form we previously assumed we knew how to fit such a model? Θ (x) equals θ0 +θ1xx1 +θ2xx2 +θ3xx3 So if we want to fit this Three times the model is the Green square. Now we're talking about predicting the price of a house. We use θ0 plus θ1 multiplied by the size of the house, plus θ2 times the square of the house, so this equation is equal to that formula and then added θ3 multiplied by the cube of the House area in order to define these two In order to do this we naturally think of the X1 feature set to the size of the house to set the second feature X2 to the square of the House area the third feature X3 is set to the cube of the house area so just by setting these three features and then applying the linear regression method, I can fit this. Model and finally fit a three-time function into my data I want to say one more thing, that is, if you choose a feature like this, the normalization of the feature becomes more important, so if the size of the house is between 1 and 1000, so for example, from 1 to 1000 square feet, the square range of the house area is One to 1 million is 1000 squared and your third characteristic x's cubic sorry for your third characteristic x3 it is that the cubic range of the house area will be enlarged to 1 to 10 9 times so the range of the three features is very different so it is very important to use the gradient descent method to apply eigenvalue normalization. So that the range of their values becomes comparable finally here is the last example about how to make you really choose the features you want to use before we talked about a two-time model like this is not ideal because you know maybe a two-time model can fit this data nicely, but the two-time function will eventually drop. What I don't want is that housing prices go down as predicted. But instead of creating a three-time model, you might have other options. Here are a number of possibilities, but give you another example of a reasonable choice. Another reasonable option might be that the price of a house is θ0 Add θ1 times the area of the house and add θ2 times the square root of the house area, OK? The square root function is such a function that perhaps θ1θ2θ3There will be some values that will capture this model so that the curve looks like this trend is rising but slowly becoming smoother and never falling back so by deep research here we study the shape of the square root function and more deeply understand the shape of the data when selecting different features can sometimes get more Good model in this video we discuss the polynomial regression, which is how to fit a polynomial like a two function or a three function to your data. In addition to this, we discussed the selectivity of using features such as the frontage width and depth of the house we do not use, perhaps you can multiply them together from And to get the land area of the House this feature actually seems a little hard to decide here are so many different features to choose how do I determine what characteristics to use? In the following lesson we will explore some algorithms that automatically select what features to use so you can use an algorithm to observe the data given and automatically Choose whether you should choose a two-time function or a three-time function or something, but before we learn that algorithm, I want you to know what features you need to choose and by designing different features you can use more complex functions to fit your data instead of just fitting in a straight line. You can use polynomial functions and sometimes you can get a model that is more consistent with your data by taking the appropriate angle to observe the feature.

Machine learning– 2nd week

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More