WEEK1:
Machine learning:
- A computer program was said to learn from experience E with respect to some class of tasks T and performance measure P, if Its performance on tasks in T, as measured by P, improves with experience E.
- Supervised learning:we already know what we correct output should look like.
- Regression:try to map input variables to some continuous function.
- Classification:try to map input variables into discrete categories.
- Unsupervised learning:we only has little or no idea, what we results should look like.
- Clustering:find A-automatically group data into groups that is somehow similar or related by different variables.
- Non-clustering:find structure in a chaotic environment,like the ' Cocktail Party algorithm '.
Model representation:
- X(i): Input Features
- Y(i): Target variable
- (x (i,< Span class= "Mi" >y (i )
- (x(i),y(i)); I=1,.. . ,m:training set
- M:number of training examples
- H (x): hypothesis,θ0+θ1x1
Cost Function:
- This takes a average difference of all the results of the hypothesis with inputs from X ' s and the actual output Y ' s.
- Algorithm: (the mean is halved) as a convenience for the computation of the gradient descent, as the derivative term of The square function would cancel out of the term.)
- We use contour plot to show how to minimize the cost function.
Gradient Descent:
- Help us to estimate the parameters in the hypothesis function.
- Algorithm: (Repeat until convergence)
- J=0,1:feature Index Numbe
- Α:learning rate or the size of each step. Ifαis too small,gradient descent can be slow. Ifαis too large,gradient descent can overshoot the minimum.
- Partial derivative of j:direction of each step
- At each iteration J, one should simultaneously update all of the parameters.
Gradient descent for Linear Regression:
- Algorithm:
- This method looks at every example in the entire training set on every step, and is calledbatch gradient descent.
Linear Algebra:
- I have learned liner algebra in my college so I'll skip this part in my note.
Week2:mutiple Features:
- N:number of features
- X (i): input of ith training example
- X (i) J:value of feature J in ith training example
- hθ (x): θ0x0+θ1x1+θ2x2+θ3x3+?+θnxn= (assume x0 = 1)
Gradient descent for multiple Variables:
- Algorithm:
- Feature Scaling:
- Feature scaling:dividing The input values by the range (max-min) of the input variable. Get every feature into approximately A-1 <= XI <= 1 range.
- Mean normalization:subtracting The average value for a input variable from the values for that input variable resulting i n a new average value for the input variable of just zero.
- Whereμi is the average of the values for feature I and SI are the range of values (max-min), or SI is the standard D Eviation.
- Learning Rate:make A plot with number of iterations on the x-axis. and J (θ) on the y-axis. If J (θ) ever increases, then your probably need to decreaseα.it have been proven that if learning rateαis sufficiently SM All and then J (θ) would decrease on every iteration. To Chooseα,try 0.001,0.003,0.01 ...
- Features and polynomial regression:we can improve our Features and the form of We hypothesis function in a couple differe NT Ways
- We can combine multiple features into one. We can get a new feature x3 by taking X1 * X2
- We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
- If you choose your features this and then feature scaling becomes very important.
Normal equation:
- Formula:
- Example:
- There is no need to does feature scaling with the normal equation.
- If (X^TX) is Non-invertibale:
- Delete redundant features such as X1 = size in feet^2 and x2 = size in m^2.
- Delete features to make sure that M > n or use regularization.
Octave:
- GNU Octave Docs
- Vectorization can simplify the codes.
Week3:classfication:
- The classification problem is just like the regression problem, except, and the values we now want to predict take in only A small number of discrete values.
- X (i): Feature
- Y (i): Label for the tranning example
Logistic Regression:
- We change the form for our hypotheses to satisfy 0 <= H (x) =1 by plugginθ^tx into the Logistic Function.
- Formula:
- Decision Boundary:the Line This separates the area where y = 0 and where y = 1.It are created by hypothesis function (θ^tx=0 ).
- Cost Function:
We can compress our cost function ' s of the conditional cases into one case:
- Gradient descent:this algorithm is identical to the one we used in linear regression. But the H (x) is changed.
Optimization algorithms:
- Conjugate gradient
- Bfgs
- L-bgfs
- We can write codes below to use Octave ' s "fminunc ()"
Multiclass classification:
- Train a logistic regression classifier hθ (x) for each class? To predict the probability that?? y = I??. To make a prediction on a new X pick the class maximizes hθ (x)
Overfitting:
- Even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor.
- Options to address overfitting:
- Reduce the number of features.
- Regularzation.
- regularized Linear Regression:
- Cost Funcion: (lambda is the regularization parameter.)
- Gradient Descent:
- Normal equation:
- regularized Logistic Regression:
- Cost Function:
- Gradient Descent:
Week4:neural network:representation:
- If we had one hidden layer, it would look like:
- The values for each of the "Activation" nodes:
- Each layer gets it own matrix of weights: (the ' +1 ' comes from the ' bias nodes ', the output nodes would not include the bias nodes while the inputs would.)
- Vectorized:
- We can set different theta matrix to construct fundamental options by using a small neural network.
- We can construct more complex options by using hidden layers.
- Multiclass Classification:we Use One-vs-all method and let hypothesis function return a vector of values.
Week 5:neural Network:Learning:Cost Function:
- L:total number of layers in the network
- Sl:number of units (not counting bias unit) in layer L
- K:number of output units/classes
BackPropagation algorithm:
- "BackPropagation" is a neural-network terminology for minimizing we cost function.
- Algorithm:for t = 1 to M:
- We get
- Using code like this is unroll all the elements and put them into one long vector. Using code like this to get back original matrices.
- Gradient Checking:we can approximate the derivative with respect Toθj as follows:
- Training:
Week 6:applying Machine learning:evaluating a hypothesis:
- Set 70% of date to being the training set and the Remainning 30% to is the test set.
- In order to choose the model of your hypothesis, we can test each degree of polynomial by using a cross validation set. (20% training set,20% Cross validation set,60% test set)
Bias vs. Variance:
- High bias are underfitting and high variance are overfitting. Ideally, we need to find a golden mean between these.
- High Bias:
- High Variance:
- In order to choose the model and the regularization termλ, we need to:
- If A learning algorithm is suffering from high bias, getting more training data would not be help much.
- If A learning algorithm is suffering from the high variance, the Getting more training data are likely to help.
- A neural Neural network with fewer parameters are prone to underfitting. It is also computationally cheaper.
- A large neural network with more parameters are prone to overfitting. It is also computationally expensive.
Machine learning System desing:
- The recommended approach:
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot Learning curves to decide if more data, more features, etc. is likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were Made.
- It is very important to get error results as a single, numerical value.
- Precision
Handling skewed Data:
- Skewed classes:the ratio of positive to negative examples are very close to one of the extremes.
- (y = 1 in presence of rare class so we want to detect)
- Precision RATE:TP/(TP + FP)
- Recall RATE:TP/(TP + FN)
- F1 Score: (2 * p * r)/(P + r)
Week 7:support Vector machines:optimization Objective:
- Because constant doesn ' t change value of the theta it achieves the miinmum,so we multiplying objective function in Logis Tic regression by M.
- We can both use (A +λb) or (CA + B) to control the relative.
- A support vector machine just makes a prediction of y being equal to one or zero, directly. So the hypothesis would predict one
Large Margin Intuition:
- The SVM decision boundary would become like this:
- The black line gives SVM a robustness because it has a large margin:
Kernels:
- Given (Xi,yi), we choose Li = XI as landmarks,then let fi = Sim (X,li).
- We compute new features depending on proximity to landmarks. So our function become theta0 + THETA1*F1 + theta2*f2 ...
- Gaussian Kernels:
- C and Sigma:
- Do perform feature scaling before using the Gaussian kernel.
- Linear kernel:meanning no kernel.
WEEK8:
Unsupervised Learning:
Clustering:
- We give unlabeled training set to a algorithm and we ask the algorithm find some structure in the data for us.
- K-meas algorithm:
- Cost Function:
- Random initialization:randomly pick K training examples and set Mu1 of MuK equal to these k examples.
- Elbow Method:
- Better-Choose the number of clusters is-ask, for-what purpose was you running K-means.
dimensionality Reduction:
- Reason:data compression or speed up our learning algorithm.
- Visualization:we can use dimensionality reduction-to-reduce data from-dimensions down-to-2 or 3 dimensions,so that We Can plot it and understand our data better.
Principal Component Analysis:
- Pca:find a lower dimensional surface onto which to project the data, so as to minimize the square distance between each PO int and the location of where it gets projected.
- Reduce from 2D to 1d:find a vector onto which to project the data to minimize the projection error.
- Reduce from ND to kd:find K vectors onto which to project the data to minimize the projection error.
- Data preprocessing:feature Scaling/mean Normalization
- Algorithm:
- If we want to reduce the data from n dimensions down to K dimensions, we need to do are take the first K-vectors from U (n * N) as Ureduce (n * k).
- z = Ureduce ' * x.
- Reconstruction from Compressed Representation:xapprox = Ureduce * Z.
- Applying: (only if your algorithm doesn ' t does what, want then implement PCA)
Week 9:
Anomaly Detection:
Density Estimation:
- We build a model of the probability of x,if p of X-test is less than some epsilon then we flag this as an anomaly.
- Gaussian Distribution (Normal distribution):,
- Parameter Estimation:
- Algorithm:
- Evaluation:assume we have some labled data of anomalous and nonanomalous examples. Using Training Set (Unlabled,assume normal examples), cross validation set and test set.
- Anomaly Detection vs. supervised learning:
- Non-gaussian Features:let xnew = log (x) (logarithmic normal distribution), or xnew = x^ (0.1)
- Choose Features:choose Features that migth take on unusually large or small values in the event of an anomaly
Multivariate Gaussian Distribution:
Recommender Systems:
- N.U = number of users
- N.M = number of moives
- R (i,j) = 1 if user J has rated movie I
- Y (i,j) = rating given by the user J to movie I (only if R (i,j) = 1)
- Theta (j) = parameter vector for user J
- X (i) = feature vector for movie I
Content Based Recommendations:
- We assume we have features for different movies.
- For each user j,learn a parameter. Predict user J as rating movie I with Stars.
- Optimization Objective:
- Gradient Descent:
Collaborative Filtering:
- We assume that all of our users have told us how to much they like the romantic movies and what much they like action packed m Ovies.
- Optimization algorithm:
- Given x and movie ratings can estimate Theta.
- Given theta and movie ratings can estimate x.
- Optimization Objective:
- Mean Normalization:compute The average rating that each movie obtained and subtract off the meaning rating. So the rating of movie become + average rating.
Week 10:
Large Scale machine Learning:
Stochastic Gradient descent:
- Randomly shuffle the data set.
- For i = 1...m:
- SGD would only try to fit one training example at a time. This "We can make" progress in gradient descent without have to scan all M training examples first.
- We'll usually take 1-10 passes through data set to get near the global minimum.
- Convergence:plot the average cost of the hypothesis applied to every/so training examples. We can compute and save these costs during the gradient descent iterations.
- One strategy for trying to actually converge at the global minimum are to slowly decreaseαover time.
Mini-batch Gradient descent:
- Use b examples with each iteration. (b = mini-batch size)
- Algorithm:
- The advantage is, we can use vectorized implementations over the B examples.
Online Learning:
- With a continuous stream of users to a website, we can run an endless loop that gets (x, y), where we collect some user act Ions for the features in X to predict some behavior Y.
- You can updateθfor each individual (x, y) pair as you collect them. This is the can adapt to new pools of users, since is continuously updating theta.
Map Reduce and Data Parallelism:
- Many learning algorithms can be expressed as computing sums of functions over the training set.
- We can divide up batch gradient descent and dispatch the cost function for a subset of the data to many different machines So, we can train our algorithm in parallel.
Week 11:
Photo OCR:
- Text detection
- Character segmentation
- Character classification
- Using sliding windows and expansion to text detection and character segmentation
- Ceiling Analysis
Artificial Data Synthesis:
- Creating new data from scratch (using the ramming funds as an example)
- Taking existing label examples and introducing distortions to it, to sort of create extra label examples.
Machine Learning| Andrew ng| Coursera Wunda Machine Learning Notes