Finally Pass all the Deeplearning.ai courses in march! I highly recommend it!
If you are already know the basic then are interested in Course 4 & 5, which shows many interesting cases in CNN an D RNN. Altough I do think that 1 & 2 are better structured than others, which give me more insight into NN.
I have uploaded the assignment of all the deep learning courses to my GitHub. You can find the assignment for CNN here. Hopefully it can give you some help when you struggle with the grader. For a new course, your indeed need more patience to fight with the grader. Don ' t ask me how I know this ... >_<
I have finished the summary of the first course in my pervious post:
- Sigmoid and Shallow NN.
- Forward & Backward Propogation,
- Regularization
I'll keep working on the others. Since I am using CNN at work recently, let's go through CNN first. Any feedback is absolutely welcomed! And please correct me if I do any mistake.
When talking on CNN, image application is usually what comes to our mind first. While actually CNN can is more generally applied to differnt data, fits certain assumption. What assumption? You'll know later.
1. CNN Features
CNN stands out from traditional NN in 3 area:
- Sparse Interaction (Connection)
- Parameter sharing
- Equivariant representation.
Actually the third feature is more like a result of the first 2 features. Let's go through them one by one.
Fully Connected NN |
nn with Sparse connection |
|
Sparse Interaction , unlike fullly connected neural network, for convolution layers each output are only Connected to limited inputs like above. For a hidden layer, takes \ (m\) neurons as input and \ (n\) neurons as output, a fullly connected hidden layer has a weight matrix of size \ (m*n\) to compute each ouput. When \ (m\) was very big, the weigt can be a huge matrix. With sparse connection, only \ (k\) input was connected to each output, leading to a Decrea Se in computation scale from \ (o (m*n) \) to \ (o (k*n) \) . And a decrease in memory usage from \ (m*n\) to \ (k*n\) .
Parameter Sharing have more insight when considered together with sparse connection . becasue Sparse Connection creates segmentation among data. For example \ (x_1\) \ (x_5\) was independent in above Plot due to sparse connection. However with parameter sharing, same weight matrix are used across all positions, leading to a hidden connectivity. Additionally, it can further reduces the memory storage of weight matrix from \ (k*n\) to \ (k\) . Especially when dealing with image, from \ (m*n\) to \ (k\) can is a huge improvment in memory usage.
equivariant Representation is a result of parameter sharing. Because same weight matrix is used at different position across input. The output is invaritate to parallel movemnt. Say \ (g\) represent parallel shift and \ (f\) is the convolution function, then \ (f (g (x)) = g (f (x)) \) . This feature can is very useful when we are on the presence of feature not their position. But on the other hand this can being a big flaw of CNN that it's not good at detecting position.
2. CNN Components
Given the above 3 features, let's talk on how to implement CNN.
(1). Kernel
Kernel, or so-called filter, is the weight matrix in CNN. IT implements element-wise computation across input matrix and output the sum. Kernel usually have a size that's much smaller than the original input so this we can take advantage of decrease in memory .
Below is a 2D input of convolution layer. It can be greyscale image, or multivarate timeseries.
When input was 3D dimension, we call the 3rd dimension Channel (volume). The most common case was the RGB image input, where each channel is a 2D matrix representing one color. See below:
Please keep in mind, Kernel always has same number of channel as input! Therefore it leads to dimension reduction in all dimensions (unless your use 1*1 kernel). But we can has multiple kernels to capture differnt features. Like below, we had 2 kernels (filters), each with dimension (3,3,3).
Dimension Cheatsheet of Kernel
- Input Dimension (N_w, N_h, N_channel). When N_channel = 1, it is a 2D input.
- Kernel Dimension (N_k, N_k, N_channel). Kernel is isn't always a square, it can be (N_K1, N_K2, N_channel)
- Output Dimension (n_w-n_k + 1, N_h-n_k + 1, 1)
- When we have n different kernels, output dimension'll be (N_w-n_k + 1, N_h-n_k + 1, N)
(2). Stride
Like we mention before, one key advantage of the CNN is to speeed up computation using dimension reduction. Can we be more aggresive on this?! Yes we can use stride! Basically stride is while moving kernel across input, it skips certain input by certain length.
We can easily tell how stride works by below comparison:
No Stride
Stride = 1
Thanks Vdumoulin for such great animation. You can find more on his GitHub
Stride can further speed up computation, but it'll lose some feature in the output. We can consider it as output downsampling.
(3). Padding
Both Kernel and Stride function as dimension reduction technic. For each convolution layer, the output dimension'll always be smaller than input. However if we want to build a deep convolution network, we don ' t want the input size to shrink too fast. A small kernel can partly solve this problem. But the order to maintain certain dimension we need zero padding. Basically it is adding zero-to-your input, like below:
Padding = 1
There is a few types of padding it is frequently used:
- Valid Padding:no padding at all, output = input-(K-1)
- Same Padding:maintain samesize, output = input
- Full Padding:each input is visited k times, output = input + (k-1)
To summarize, We use \ (s\) to denote stride, and \ (p\) denotes padding. \ (n\) is the input size, \ (k\) was kernel size (kernel and input are both sqaure for simplicity). Then output dimension would be following:
\[\lfloor (n+2p-k)/s\rfloor +1\]
(4). Pooling
I remember in a latest paper of CNN, the author says that I can ' t explain what I add pooling layer, but a good CNN Structur E always comes with a pooling layer.
Pooling functions as a dimension reduction technic. But unlike Kernel which reduces all dimensions, pooling keep channel dimension untouched. Therefore It can further accelerate computation.
Basicallly Pooling ouputs A certain statistics for a certain amoung of input. This introduces a feature stronger than equivariant representation-- invariant representation.
The mainly used Pooling is Max and average Pooling. And there is L2, and weighted average, and etc.
3. CNN structure (1). Intuition of CNN
In deep learning book, author gives a very interesting insight. He consider convolution and pooling as a infinite strong prior distribution. The distribution indicates, all hidden units share the same weight, derived from certain amount of the input and has Parallel invariant feature.
Under Bayesian statistics, prior distribuion is a subjective preference of the model based on experience. and the stronger the prior distribution is, the higher impact it'll has on the optimal model. So before we use CNN and we have to make sure, which our data fits the above assumption.
(2). Classic structure
A classic convolution neural network has a convolutional layer, a Non-linear activation layer, and a pooling layer. For the deep NN, we can stack a few convolution layer together. Like below
The above plot is taken from Adit Deshpande ' s A Beginner ' s Guide to Understanding convolutional neural Networks, one of my Favoriate blogger of ML.
The interesting part of Deep CNN was that deep hidden layer can receive more information from input than shallow layer, mea Ning Although the direct connection is sparse, the deeper hidden neuron be still able to receive nearly all the features From input.
(3). To be continue
With the learning more and the more on NN, I gradually realize that NN are more flexible than I thought. It's like LEGO, convolution, pooling, they was just different basic tools with different assumption. You need to analyze your data and select Tools this fits your assumption, and try combining them to improve performance in Teratively. Latrer I'll open a new post to collect all the NN structure that I ever read about.
Reference
1 Vincent Dumoulin, Francesco visin-a Guide to convolution arithmetic for deep learning (BibTeX)
2 Adit deshpande-a Beginner ' s Guide to Understanding convolutional neural Networks
3 Ian Goodfellow, Yoshua Bengio, Aaron conrville-deep Learning
Deeplearning-overview of convolution neural Network