Notes | Wunda Coursera Deep Learning Study notes

Source: Internet
Author: User
Tags assert ord scalar coursera deep learning

Programmers who have turned to AI have followed this number ☝☝☝


Author: Lisa Song

Microsoft Headquarters Cloud Intelligence Advanced data scientist, now lives in Seattle. With years of experience in machine learning and deep learning, we are familiar with the requirements analysis, architecture design, algorithmic development and integrated deployment of machine learning and AI products under various business scenarios.



Wunda Coursera Deep Learning Learning Note 1 (UP)


"Learning experience"


Coursera and deeplearning.ai cooperation of deep learning specialization out really slow ah ... Now only out of the course 1:neural Networks and deep learning, and then there are 4 courses. Although it seems to me and most deep-learning practitioners to be very basic (Jian) (Dan), sometimes the exercise can actually do wrong. My qualifications are mediocre, so I still have to revisit this basic course from time to time. It is not self-esteem to study humbly.


"Learning Notes"


Andrew's first class was all around and the amount of homework was increasing every week. Can see Coursera gradually increased the proportion of exercises, this progress I personally like, suggest that we all go to do a class after the exercises, especially the python part.


At the end of the first three weeks of the week there was an interview with deep learning heroes, except for the first week of Geoffrey Hinton I looked at, behind the Pieter Abbeel and Ian Goodfellow I did not see, the time is too long. Mainly because the textbook for students too basic words, learning itself easily distracted, so I do not want to spend a limited attention, and later interested will come back to see.


I follow the theoretical knowledge +python practice two major modules to summarize. Note Here is only my personal summary of the course, focusing on my weak links, suggesting that interested readers or go to Coursera to buy textbooks, so that learning is more comprehensive. Budget-strapped readers don't have to worry, the first 7 days are free.


Deep Neural Networks


What is a neural network.


Neural networks (neural network) is not a new noun, but with the data, computational power, theoretical breakthroughs in three, in recent years has ushered in the winter after the spring. Neural networks are mainly composed of neurons (neuron), a neuron is usually a linear combination of multiple inputs + an activation function, where the activation function is often a nonlinear function. Like the human brain, many neurons are combined with a variety of links to have a powerful ability. Andrew gave two simple examples at the outset.


single neuron networks (neural network)


If there is only one variable in our input, applying a relu function on top of it forms a single neuron network. Relu function full name is rectified Linear Unit, do not be frightened by this name, in fact, deep learning is completely paper tiger, Relu is actually a max function, a bit similar to Mars. The shape of it is this:


Single neural Network (Relu function)


sigmoid function


The image above is a traditional sigmoid function, which is not only used in logistic regression, but also in deep learning. When you compare Relu functions and sigmoid functions, you'll find that Relu replaces part of the data with 0, which makes it faster to compute large-scale data than sigmoid (imagine that if each layer Relu discards 50% of the useless data, then 4 of the data is the original 6%, Of course the real situation is not so simple), and relu another part of the derivative is 1, in the reverse propagation of the gradient is very convenient, sigmoid function is everywhere, but in the vast majority of the range of the derivative is very small, which makes it does not filter out the ability of the unwanted data, greatly reducing the gradient decline of learning speed, And there may be a problem with the gradient disappearing in the case of reverse propagation.


Multi-neuron networks (multiple neural network)


Multi-Neuron Network


Deep learning can be applied not only to traditional structured data, but also to unstructured data such as sound, image and text. The application of deep learning was mentioned in the previous lecture period. I want to express that deep learning is easier to handle unstructured data than traditional methods. The result unexpectedly structured this word to forget, card for half a day finally fooled a two-dimensional vs. multidimensional-in fact, I meant structured vs. unstructured. Fortunately, it is only one language, not the knowledge point, no one is expected to notice. Can see the letter is better than no book, especially in the deep learning such a new, mixed-use industry.


Structured and unstructured data


logical regression based on the structure of neural networks (logistic Regression)


The familiar logistic regression can be regarded as one of the simplest neural networks, where z is a linear combination of the input features, and the sigmoid function is a nonlinear transformation of Z. As a result, Andrew constructs a most basic neural network based on the logistic regression, which is the input of the vectorization image. The input of the picture is originally three-dimensional (two-dimensional pixel point plus RGB three colors), through image2vector each picture into a 1-width matrix-a a,b,c,d dimension of the tensor flatten (b*c*d, a) tips: x_ Flatten = X.reshape (X.shape[0],-1). T, where x.t is the transpose of X, and a represents the number of samples.


Logistic regression with neural network results: Identify whether the picture is a cat


The loss function of the Logistic regression (loss function, loss of a sample) with the cost function (price functions, the loss of all samples of the training data)


In summary, the modeling of neural networks consists of the following major steps:


Define the model structure, such as how many layers, what activation functions, how many input features, and so on;

Initialize the model parameters, in the case of the logistic regression, just initialize W and b for 0, and do not need to add some jitter;

Perform num_iterations cycles:
3.1 Forward propagation: calculating the current cost
3.2 Reverse propagation: Calculating the current gradient
3.3 Gradient Drop: Update parameters, updated parameters for θ_updated=θ−αdθ, where α is the learning rate, dθ is actually dj/dθ.


Until the iteration ends, or the gradient is approximately zero and the parameters no longer change.


The derivative of the sigmoid function should be kept in mind: sigmoid_derivative (x) =σ′ (x) =σ (x) (1−σ (x)) and two derivative formulas:


Gradient descent of logistic regression in the formula of DW and DB


Its code is implemented in a forward-propagating function as follows:
m = x.shape[1]
A = sigmoid (Np.dot (w.t, X) + b)
Cost =-np.sum (Y*np.log (A) + (1-y) *np.log (1-a))/M
DW = Np.dot (X, (a-y). T)/M
db = Np.sum (a-y)/M
To ensure that there is no error in matrix calculations and broadcasting, the following assertions are made:
ASSERT (Dw.shape = = W.shape)
ASSERT (Db.dtype = = float)
Cost = Np.squeeze (costs) # The dimension that lowers the cost makes it a number
ASSERT (Cost.shape = = ())


Python


Python Broadcasting


I think this piece is very exciting, because a lot of tutorial inside just a few words, Andrew spoke more. In matrix operations, if you do not notice the broadcasting, there is a bug is not very good tune. So I'm here to focus on:


Broadcasting is refers to the different shape of the matrix arithmetic, numpy to the small matrix propagation processing (in fact, is Ctrl-c + ctrl-v, so that the small matrix "grow" into a large matrix and the harmonious operation of the method). Here's a concrete description: The propagation operation of the Matrix (https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).


This is handy, for example, when calculating Np.dot (w.t, X) + B, because sometimes B is just a scalar (scalar), and the first is probably a matrix, broadcasting makes this step calculation convenient.


>>> a = Np.array ([1.0, 2.0, 3.0])
>>> B = 2.0
>>> A * b # in multiplication calculations, NumPy's matrix propagation mechanism automatically expands the scalar of B to Np.array ([2.0, 2.0, 2.0]) so that it can be multiplied with each element of a.
Array ([2., 4., 6.])


For example, the normalization by row of a matrix is the matrix divided by the norm of each line:


Normalizing rows


The above figure is implemented by:
>>> X_norm = Np.linalg.norm (x, ord=2, Axis=1, Keepdims=true)
>>> x = X/x_norm
Where ord=2 means that norm 2,axis=1 represents the horizontally of each row (by row), Axis=0 represents the calculation of vertically, which is the column (by column). Keepdims=true is to prevent numpy output rank 1 array--shape to (n,), and the Output shape is (n,1) after using keepdims=true, the difference between the two will be shown below.


Another embodiment of broadcasting efficiency is the SOFTMAX function, commonly used to standardize the output of a multi-class classification problem, so that the total probability of the output of the various types add up to 1. The function expressed in the figure Softmax below shows that the denominator and the original matrix are different shapes. The denominator is one dimension less than the original matrix, that is, when the original matrix is 1*n, the denominator is the n number of the and, therefore, is a scalar, when the original matrix is m*n, the denominator is the number of n per row and so is a m*1 matrix (array). Broadcasting allows the numerator to be divided directly by the denominator to obtain the normalized result of each number of the original matrix.


Softmax function


The code for Softmax is:
>>> x_exp = Np.exp (x) # with Exp in NumPy instead of exp in math because the function of the math package usually accepts only the input of the real number, and numpy can accept the tensor input, so in deep learning we seldom use math, And more use of NumPy method, the previous log is the same.
>>> x_sum = Np.sum (X_exp, Axis=1, Keepdims=true)
>>> s = x_exp/x_sum
>>> return S


With the propagation mechanism, when two matrices have the same dimension, or if one of the matrices has a dimension of 1, the two matrices can be subtraction by element-wise. At the same time, however, to reduce the likelihood of a bug, it is recommended to use Assert + shape to check the shape of the matrix, or use the Reshape method.


Dot, element-wise, and outer product (dot product, element product, and outer product) in NumPy


Dot product (dot product)


Np.dot (A, B) is the matrix multiplication in ordinary linear algebra, which is calculated according to the rule of matrix multiplication, which requires the dimension of two matrices and the Operation Law of linear algebra multiplication.


Element-wise Product (Elemental product)


Np.multiply () and * are both element product, and are the elements of the array are computed on an individual basis. Note that the broadcasting applies to the case of the element product (*) rather than the dot product. Such as:


A = Np.random.randn (4, 3) # A.shape = (4, 3)
b = Np.random.randn (3, 2) # B.shape = (3, 2)
c = A*b will error because B needs to be 4*1 or 3*1 in order to broadcasting.
c = Np.dot (A, b) will get a 4*2 matrix.


Outer product (outer product)


NumPy's outer product in Out[i, j] = a[i] * B[j], that is, two one-dimensional matrices A = [A0, a1, ..., AM] and b = [B0, B1, ..., BN] outer product is:
[[A0*b0 a0*b1 ... a0*bn]
[A1*b0.
[ ...                   .
[Am*b0 Am*bn]]


>>> Import NumPy as NP
>>> x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
>>> x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]
>>> Np.dot (X1,X2)
278
>>> Np.outer (X1,X2)
[[81 18 18 81 0 81 18 45 0 0 81 18 45 0 0]
[18 4 4 18 0 18 4 10 0 0 18 4 10 0 0]
[45 10 10 45 0 45 10 25 0 0 45 10 25 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[63 14 14 63 0 63 14 35 0 0 63 14 35 0 0]
[45 10 10 45 0 45 10 25 0 0 45 10 25 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[81 18 18 81 0 81 18 45 0 0 81 18 45 0 0]
[18 4 4 18 0 18 4 10 0 0 18 4 10 0 0]
[45 10 10 45 0 45 10 25 0 0 45 10 25 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
>>> np.multiply (x1,x2) # or X1*X2
[81 4 10 0 0 63 10 0 0 0 81 4 25 0 0]
vectorization


This part means that you try to use a matrix operation instead of a for loop. For example, in reverse propagation (back propagation), when each sample loss function is calculated and added to the cost function, and each parameter is updated, the matrix operation is used. It is only possible to use a for loop when iteration, which is an unavoidable estimate.


Andrew shows 300 times times difference in operation time using for loop and vectorization (millions of numbers)


* Where NUMPY.RANDOM.RANDN (D0, D1, ... dn) returns an n-dimensional array sampled from the standard average distribution.


numpy Application Tips


To avoid a rank 1 array similar to (5,), such an array would at first glance be a 5*1 matrix, but would result in a completely different result from the 5*1 matrix in the calculation. The way to avoid this is to set the dimension or reshape method at the time of definition, or use multiple assert + shape checks when calculating.


A is the result of rank 1 array


If the input is Np.random.randn (5, 1), the result of the output is completely different, even though a. T up is the same.


A is the result of the 5*1 matrix



Wunda Coursera Deep Learning Learning Note 1 (next)



"Learning Notes"


Previously, we introduced the Neural Network (regression) as an example of logistic regression, but it did not hide the layer, so it is not a strictly neural network. In this article, let's deepen the neural network along with Andrew and add some relu neurons before sigmoid. In the end, we will build a cat pattern recognition model with deep learning, for example, with a cute cat pattern that is wildly obsessed with AI scientists. But before we do, let's take a look at what activation functions, and what their pros and cons are in addition to the sigmoid and Relu we talked about last time.


activation function Activation Functions


Sigmoid: Except in the output layer (when the output is a binary classification of {0,1}), sigmoid is rarely used in the hidden layer, because its mean is 0.5 and, in the same case, replaced with a tanh function with a mean of 0.

Tanh: Actually is the shifted version of sigmoid, but the output from (0, 1) becomes (-1, 1), so the next layer of input is centered, more popular.

ReLU (Rectified Linear Unit): Coursera activation function is mentioned in Wunda learning deep ReLU Learning Note 1 (above), which is more commonly used in depth learning than sigmoid and Tanh. This is because when the value of the input z of the activation function is large or small, the gradient of the sigmoid and Tanh is very small, which greatly slows down the learning speed of the gradient descent. Therefore, compared with sigmoid and tanh, the training speed of relu is much faster.

Leaky Relu:leaky ReLU than ReLU to behave a little better lost but in practice we often use ReLU.
from shallow neural network to deep neural network


In the first example of logistic regression, there is no hidden layer. In the next lesson, Andrew introduced the neural networks of 1 hidden layers, 2 hidden layers, and L hidden layers respectively. But as long as the forward propagation and reverse communication to understand, plus some of the small steps that will be described later, you will find that in fact are the same. Are you ready? Take a deep breath with me and plunge into the ocean of deep learning ~


The general input layer is not counted into the number of layers of the neural network, if a neural network has an L layer, then it means that it has L-1 hidden layer and an output layer. We can observe the input and output layers, but it's not easy to see how the intermediate data changes, so the part between the input and output layers is called the hidden layer.


Training a deep neural network is broadly divided into the following steps (Wunda Coursera Depth Learning study Note 1 (above) is also described in detail):


1. Define the structure of the neural network (Hyper-parameter)

2. Initializing Parameters

3. Loop iterations

3.1 In forward propagation, it is divided into linear forward and activation forward. In linear forward, z[l]=w[l]a[l−1]+b[l], where a[0]=x; in activation forward, a[l]=g (Z[l]). The values of W, B, z are stored during the period. Finally, the current loss is calculated.
3.2 In the reverse propagation, divided into activation backward and linear backward.
3.3 Update parameters.


The following figure shows a deep neural network of the L-layer, in which the L-1 hidden layer is used for the training steps and processes of the Relu activation function. </

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.