Deep learning Stanford CS231N Course notes

Last Update:2016-07-03 Source: Internet

Author: User

Tags svm keras

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

For deep learning, novice I recommend to see UFLDL first, do not do assignment words, one or two nights can be read. After all, convolution, pooling what is not a particularly mysterious thing. The course is concise, sharply, and points out the most basic and important points.
cs231n This is a complete course, the content is a bit more, although the course is computer vision, but 80% is the content of deep learning. The work of the image is not available for the time being, I will skip it first.
Suddenly found that both courses are Stanford, cattle school is cattle.

Course Home http://vision.stanford.edu/teaching/cs231n/syllabus.html
Baidu Cloud Video Archive & Courseware. YouTube video was reported off the shelf, not known as fur. Http://pan.baidu.com/s/1pKsTivp

Lecture 1

History lesson, blowing water, as practice listening, skipping problems is not big. Chinese is still slow, the back lecture2 others speak of when the speed of a sudden confused force.

It seems that Fei-fei Li is the only one to appear. But they are really Daniel, so the name is still in front of the PPT.

Lecture 2-4

Review of machine learning. It's good to practise listening. I have written a blog before, can be compared to see: [Kaggle Combat] Digit recognizer– from KNN,LR,SVM,RF to deep learning.

KNN for more complex images, the increase of k should be effective.
SVM This hinge loss should see it. L i = ∑ I≠ y i max(0, s J ?s y i +1) Do not understand that they go back to review the SVM.
The calculation is disassembled to activation gate,back propagation physical interpretation. I used to think it was a chain-like derivation, nothing special.
Here's a good picture, mark.

These lessons, after reviewing, the main can also gain some image intuition

Lecture 5

Reminders can take advantage of pre training good models, such as the Caffe Model Zoo. Then Finetune own own data. Look, it looks like they're all images. It is estimated that other too domain specific. After all, the image problem is very tricky, the current deep learning does bring a lot of breakthroughs, other areas to use deep learning, feeling a bit mysterious.

activation functions

Before looking at Keras document mentioned Relu, thought very complex, in fact, the formula is very simple, simple is good ah.

It is important to understand the reasons behind
* sigmoid sigmoid a variety of bad, and then began to improve.

TLDR is too long; doesn ' t read

Data Preprocessing

UFLDL inside the Zca albino what.

weight Initialization

is to tell you a conclusion, weight is not initialized good, will affect the back of the training. The distribution of weight resulting from train is not good.
According to symmetry, if the weight is not random at the time of initialization, the trained weight will be a hair.
Bias would be 0 straight.

It is good to use the ready-made conclusion directly. But if you're going to have a special activation function, you need to be aware of the initialization problem.

Divide the Relu by 2.

Batch Normalization

At first, I didn't quite understand the difference with preprocessing, as if it was just a batch. Listen listening to listen to the lecture also distracted.

Look at the assignment just know, this is inserted in the middle of the network layer.

However even if we preprocess the input data, the activations at deeper layers of the network would likely no longer be Dec Orrelated and would no longer have zero mean or unit variance since they is output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network would shift as the weight s of each layer is updated.
......
To overcome this problem, [3] proposes to insert batch normalization layers into the network.

In short, is a bit biased engineering to achieve the optimization, seemingly very useful. 2015, quite new, so ufldl not.

summary

Http://lossfunctions.tumblr.com actually have this website, have time to go to see. Loss is an important part of visual visualization of the model.

Lecture 6

The first few pages are for review, can mark a bit.

It seems that because the DL loss is highly non-convex, the following optimizations are required. LR hasn't heard anything about using Adam for anything. It's good to estimate SGD.

SGD slowest
Momentum Momentum, popular point seems to be the meaning of inertia?
Nesterov Momentum "lookahead" gradient, is the gradient of the updated position is used first. When the code is implemented, it is slightly deformed.
Adagrad a large gradient to suppress the step size.
Rmsprop optimize the above
Adam:momentum + Rmsprop

L-bfls seems to be used for LR.

L-BFGS does not transfer very well to mini-batch setting.

On the Code

# You can see that it's common to use alpha and 1-alpha to trade-off two of variables. # Momentumv = mu * v-learning_rate * dxx + = V# Nesterov MomentumV_prev = VV = mu * v-learning_rate * dxx +-mu * V_prev + (1+ mu) * V# AdagradCache + = DX * *2x + =-learning_rate * DX/(NP.SQRT (cache) +1e-7)# RmspropCache = Decay_rate * cache + (1-decay_rate) * DX * *2x + =-learning_rate * DX/(NP.SQRT (cache) +1e-7)# Adamm = beta1*m + (1-BETA1) *DXV = Beta2*v + (1-BETA2) * (dx**2) x + =-Learning_rate * M/(Np.sqrt (v) +1e-7)

Regularization:dropout

2014, is also quite new. Note that the effect is regularization.

Forces the network to has a redundant representation. is simply for the image design, because the object for different angles, or only the local what, does not affect the human recognition of objects.
Test can be dropout multiple times to find the average sampling. Monte Carlo Approximation:do Many forward passes with different dropout masks, average all redictions; or simply not dropout, pay attention to the value To scale, otherwise the value will be bigger than the original drop. Generally choose not dropout bar, practice in this.

Lecture 7

CNN when the stride,padding and other concepts. Quite simple, but certainly to figure out, the Assistant assistant must meet.
After multilayer conv, a point can see the original image range will become larger, so conv a bit like in the layers of abstract features, from point to line to face the feeling. The conv image is well explained, and other areas are not known. Of course, this class is an image.
Pooling visually, the data will be reduced to dimensions. Of course, for images, redundant information is many. See the max pooling majority, average pooling is actually the image size scaling, large map generation thumbnail. So pooling also has physical meanings on the image.
Relu seen in other places before, can be explained by distinguishing signals from noise.

Finally, cut a picture. Smaller filter is good for calculations, and later in the course.

Lecture 8

Image-related. Just a little bit. Fast R-cnn,faster r-cnn What, the name is so dick. Yolo the quickest, the back has the time to play this directly.

Lecture 9

Tell you how to visualize and fool CNN. Something that leaves some images. Deepdream of what.

Lecture 10

RNN (Recurrent neural Networks, will not translate directly in English bar, seemingly someone translated wrong), LSTM (Long short term memory). There is no very divine thing, is to change the network structure. So don't be afraid. You can look at the MIN-CHAR-RNN code as a pointcut.

RNN

In fact, plus H (t-1) This thing, activation function is Tanh, not known as Mao.

RNN has a variety of structures, the simplest of which is vanilla RNN. That's what assignment used.

Image Captioning This example is interesting, is to extract the image from CNN H0 to Rnn, then the input of RNN is a string of Word ID, the middle of the transfer of H (t) is the inclusion of Image feature and Word ID things. It's hard to imagine doing work. Make assignment of the details yourself.

LSTM

Feel not good, better read this article http://colah.github.io/posts/2015-08-Understanding-LSTMs/, now Google LSTM, this article has been ranked first, sure enough the eyes of the masses are discerning.

Lecture 12

Learn the pronunciation of Caffe, Theano, lasagne, Keras, etc.
Pros/cons Summary of each library. But it is better to move on your own to deepen your understanding. Simply pass it over again, and after practice, take a closer look at the second time, something you have not experienced and will not understand.

Assignment

Old, picked a few to do, not completely finished.

Assignment 1

Old, didn't want to do. But you have to be familiar with NumPy, or the back assignment a bit of difficulty. If you are familiar with octave, you will not have too much difficulty.

KNN lets you feel how much faster the vectorized way code. Because the underlying code can be specifically optimized for matrix computing.
LINEAR_SVM's vectorized realization cost a lot of brains. Directly from the naive implementation to push, or draw a matrix of their own push bar, step by step is not difficult, utopian a bit around. Hinge loss is a convex function, but it has some points that are not tiny, but subgradient. Before learning SVM is to use dual, SMO what to solve, why? (Here is a quora about the meaning that it is convenient to use the kernel, to deal with the original linear non-divided situation, the optimization of the parameters are less, independent of the data dimension). SGD is not stable, and running several times may result in a different outcome. Here has begun to instill the idea of the assistant, very good.
Linear_classifiers Random Choice when you forget x and Y to use the same sample subscript ah. Hang Daddy.
Softmax score incredibly don't need exp again normalize, harm I check half a day. Now the brain turns quickly, directly with the vectorized way to achieve. Naive way instead feel to translate back, older don't write, code directly call Vectorized way.
Two_layer_net parameter Adjustment
Features image related, hog and color histogram are all written for you. Other areas to look at the extraction features can not be used for reference, color histogram is a statistical feature. Hog seems to be the characteristics of statistics, statistics of the gradient in all directions to get the edge, not a closer look, probably so. Hog=histogram of oriented gradient, oh, the last reaction came to look at the English full name is good.

When you pay attention to division, int may need to turn into float. Forget where you've met.

Assignment 2

Although the content does not mean anything, originally did not want to do. But in fact is to help you familiar with modular approach, on the back to see Caffe, torch and other source code is still helpful. Assignment design is still very clever ah. Although Relu can be seen as the activation function on a layer, it is more convenient to use layer layers when implemented. Before looking at the Keras example is a bit confused, do not know why the layer and activation function how to mix together. Because before in the UFLDL fc+sigmoid is treated as a layer.

Notebook inside to help you write "unit test", very good, so every step has checkpoint know that they did not.

NumPy default is to pass the reference, remember to use the Xx.copy () method to return a deep copy.

Assignment 3

Transfer a Tanh derivative

rnn_layers.py h (t) at the time of BP in addition to their own node output, t+1 node also has gradient passed over. Say more are tears, check half a day.

Summary

Basically the course finished, feel the course design is very good, the content is also very new. The corresponding knowledge are organized together, lest oneself fragmented to find papers, find information, or recommend that everyone is a systematic study of the better. The materials of the course is also great, don't forget to look. The programming of the assignment, not to mention, honestly do can learn a lot of things, deepen the understanding of the curriculum. Lecturer Karpathy I quite like, can go to his blog to see http://karpathy.github.io/. Deep learning has another cs224d, which is about NLP, and it feels like deep learning is more appropriate for images and NLP. The company now DL for DL, I do not know what to do with anything.

Deep learning Stanford CS231N Course notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More