Understanding deep learning requires familiarity with some simple mathematical concepts: tensors (tensor), Tensor operations tensor manipulation, differentiation differentiation, gradient descent gradient descent, and more.

"Hello World"----MNIST handwritten digit recognition

`#coding: Utf8import kerasfrom keras.datasets import mnistfrom keras import modelsfrom keras import Layersfrom keras.utils i Mport to_categorical# Load mnist DataSet (Train_images,train_labels), (test_images,test_labels) = Mnist.load_data () # Define the network architecture, net = models. Sequential () Network.add (layers. Dense (512,activation= "Relu", Input_shape= (28*28,))) Network.add (layers. Dense (10,activation= "Softmax") # defines network optimization: optimization algorithm, loss function and evaluation indicator network.compile (optimizer= ' Rmsprop ', loss= "Categorical_ Crossentropy ", metrics=[' accuracy ']) # Data preprocessing: Images zoom to [0,1] train_images = Train_images.reshape (60000,28*28) train_ Images = Train_images.astype (' float32 ')/255test_images = Test_images.reshape (test_images.shape[0],28*28) test_ Images = Test_images.astype (' float32 ')/255# data preprocessing: labels:one-hot encoding train_labels = to_categorical (train_labels) test_ Labels = to_categorical (test_labels) # Model Training Network.fit (TRAIN_IMAGES,TRAIN_LABELS,EPOCHS=5,BATCH_SIZE=128) # Model testing test _loss, TEST_ACC = Network.evaluate (test_images,test_labels) print (' Test accuracy: ', tEST_ACC) # test accuracy:0.9727 `

From the program above, we learned how to build a network and how to train on the web to recognize handwritten fonts

Data representation of neural networks

Almost all machine learning frameworks now use tensors tensor as the basic data structure. Tensor is essentially a data container, mostly numeric data, which means that tensor is a container for storing numbers. Matrices are two-dimensional tensor, and tensor is the generalization of matrices of any dimension (a dimension of tensor is often called an axis axis, not dimension).

Scalars (0D tensors) scalar--0 dimension tensor

A tensor tensor that contains only one number is called a scalar scaler (or 0D tensor). In NumPy, a number of a float32, or float64 type is a scalar. The dimensions of the tensor can be viewed through the Ndim property of the tensor, the dimensions of the tensor are 0, and the dimensions are also known as rank rank.

`>>> import numpy as np>>> x = np.array(12)>>> xarray(12)>>> x.ndim0`

Vector (one-dimensional tensor 1D)

One-dimensional arrays are called vectors, or one-dimensional tensor. One-dimensional tensor has an axis axis;

`>>> x = np.array([13, 31, 7, 14])>>> xarray([13, 31, 7, 14])>>> x.ndim1`

The above vector has 5 entries and is therefore called a 5-dimensional vector. The 5-D vector and the 5-dimensional tensor are not the same. A 5-D vector refers to an axis of 5 elements. 5-dimensional tensor has 5 axes.

Matrix (two-dimensional tensor 2D)

A vector array is a matrix, which is a two-dimensional tensor. A matrix has two axes.

`>>> x = np.array([[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]])>>> x.ndim2`

Three-dimensional tensor and higher dimensional tensor

A matrix array is called a three-dimensional tensor and can be seen as a cube of numbers.

`>>> x = np.array([[[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]],[[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]],[[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]]])>>> x.ndim3`

A 3-dimensional tensor array forms a 4-dimensional tensor, and so on. In deep learning, the tensor of 0d~4d is generally manipulated.

Core Properties

Tensor tensor consists of 3 important attributes:

- Number of axes axes (rank). The 3D tensor has 3 axes. You can view the number of axes by tensor the Ndim property.
- Shape shapes: Numeric tuples that describe the dimensions on each axis of the tensor. Tensor dimensions are (), vector dimensions are (5,), 2D tensor dimensions (3,5), 3D tensor dimensions (3,3,5).
- Data type datatype (Dtype property): The data type of the number in the tensor, such as Float32,uint8,float64, and so on.

Data Volume Batches

The first axis of data tensor in deep learning (axis 0) is usually the sample axis (sample dimension)---Represents the number of sample volumes. Mnist data Set, the sample is a digital picture.

In addition, deep learning processes data in a way that does not process the entire dataset at once, and typically divides the dataset into several batch batches. For example: Mnist 128 of small batches of samples:

`batch = train_images[:128]`

The amount of data you encounter in your life

- Vector-type data vector data--2 tensor, shape (samples,features)
- Time series data or sequence data--3 dimension tensor, shape (samples,timesteps, features)
- Picture--4 tensor, shape (samples, height, width, channels) or (samples, channels, height, width)
- Video--5 dimension tensor. Shape (samples. Frames, height, width, channels) or (samples, frames, channels, height, width)

tensors operation

All computer programs are eventually simplified to binary input binary operations (and, OR, NOR, etc.), while all transformations in the deep Learning network can be simplified to tensor operations on data tensor, such as add, multiply, etc.

Element-wise Operation Element-wise Operations

Relu Operations and addition operations are element-wise: They are applied independently to each entry in the tensor to be calculated.

For example, the For-loop implementation of the addition operation:

`def naive_add(x, y): assert len(x.shape) == 2 assert x.shape == y.shape x = x.copy() for i in range(x.shape[0]): for j in range(x.shape[1]): x[i, j] += y[i, j] return x`

Broadcast Broadcasting

The Naive_add addition operation implemented above supports only two two-dimensional tensor with the same shape. What happens if the tensor shapes of the two addition operations are different? The small tensor will broadcast to match the large tensor. The broadcast consists of two steps:

- The small tensor adds the axes broadcast axis to match the Ndim axis dimensions of the large tensor.
- The small tensor repeats in the newly added axis direction to match the shape of the large tensor.

For example, the tensor x shape is (32, 10), and the Tensor y shape is (10,). Two tensor additions. First, add a new axis to the tensor y, the shape becomes (1, 10), and then, in the direction of the new Axis repeated y32 times, the final tensor y shape is (32,10), X, y shape is the same, you can do the addition operation.

But the actual process does not create a new two-dimensional tensor, affecting the computational efficiency.

`def naive_add_matrix_and_vector(x, y): assert len(x.shape) == 2 assert len(y.shape) == 1 assert x.shape[1] == y.shape[0] x = x.copy() for i in range(x.shape[0]): for j in range(x.shape[1]): x[i, j] += y[j] return x`

Tensor dot product Operation Dot

Dot dot product operation is the most commonly used and most useful tensor operation. In contrast to the per-element operation, the dot product consolidates all entries for the input tensor.

`def naive_vector_dot(x, y): assert len(x.shape) == 1 assert len(y.shape) == 1 assert x.shape[0] == y.shape[0] z = 0. for i in range(x.shape[0]): z += x[i] * y[i] return z`

Tensor reshaping

Reshape means rearranging the rows and columns of the tensor tensor to fit a particular shape. The tensor after reshape has the same number of coefficients as the initial tensor.

`>>> x = np.array([[0., 1.],[2., 3.],[4., 5.]])>>> print(x.shape)(3, 2)>>> x = x.reshape((6, 1))>>> xarray([[ 0.],[ 1.],[ 2.],[ 3.],[ 4.],[ 5.]])`

Gradient-Based Optimization algorithm

The mathematical transformation of the input into the neural network layer is:

\ (output = Relu (dot (W, input) + b) \)

Tensor \ (w\) and tensor \ (b\) are the parameters of the network layer, which is called the weight coefficient of the network layer or can be trained parameters. These weight factors contain information that the network has learned from the training data.

Starting these weight parameters are assigned with a small random number (called random initialization). Then, the weighting coefficients are gradually adjusted based on the feedback signal. The adjustment process is called the training process.

The training process often needs to be repeated:

- Get Training data A batch batch of x, Y;
- Forward propagation gets the predicted value of y_pred on batch x;
- Calculates the loss value in the current batch: calculates the difference between y_pred and y;
- The weight coefficients are updated in the direction of loss function reduction.

Random gradient descent

A differentiable function that can theoretically find its minimum value: The minimum point derivative is 0, so you need to find all the points with a derivative of 0, and then compare each other to find the minimum value.

In a neural network, it means finding a set of weight values that minimizes the loss function.

Mini-batch SGD can be described as the following four steps:

- Get Training data A batch batch of x, Y;
- Forward propagation gets the predicted value of y_pred on batch x;
- Calculates the loss value in the current batch: calculates the difference between y_pred and y;
- The weight factor is moved in the inverse direction of the gradient-for example:\ (W-= step * gradient\), and therefore the loss function is reduced.

Random means that each small batch of batch is randomly selected in the data.

One extreme case of low-volume random gradient descent is that the stochastic gradient descent algorithm---all the data into a batch, the results are more accurate, but the efficiency is lower.

Summary

- Learning means finding a set of weights on the training data to minimize the loss function;
- Learning process: Calculates the gradient value of the loss function corresponding to the weight coefficient in the small batch data, then the weight coefficient moves along the gradient in the opposite direction;
- The probability of the learning process is based on the neural network is a series of tensor operations, so you can use the derivative of the chain law to calculate the loss function corresponding to the weight coefficient gradient value;
- Two important concepts: loss function and optimization method (need to be defined before data is sent to the network);
- Loss function: A function minimized during the training process that can be used to evaluate the quality of the model (the smaller the better, the minimum is 0);
- Optimization method: Calculate the specific method of the gradient, and then update the weighting coefficients, for example: Rmsprop, SGD, momentum, and so on.

The mathematical basis of the

[Deep-learning-with-python] Neural network