Begin
These tutorials are not a machine learning program for undergraduates or graduate students, but rather a quick conceptual endorsement. To continue with the next tutorial, you need to download the database mentioned in this chapter.
Download
You can download the relevant files on each page of the learning algorithm. If you want to download these files at the same time, you can clone the repository of this tutorial:
git clone https://github.com/lisa-lab/DeepLearningTutorials.git
Database
Mnist Database
(mnist.pkl.gz)
The Mnist database is a database of handwritten numbers that contains 60000 images for training and 10000 images to test. In a similar paper to this tutorial, the mainstream approach is to divide 60000 training images into 50000 training sets and 10000 validation sets to select Hyper-parameters such as learning rate, model size, and so on. All the pictures are uniform in size to 28*28, and the numbers are in the center of the picture. In the original dataset, each pixel is represented by a value from 0 to 255, where 0 represents black, 255 is white, and so on.
Here are some examples of the Mnist database:
To make it easier to invoke the dataset in Python, we serialized it. The serialized file consists of three lists, training data, validation data, and test data. Each element in the list is made up of images and corresponding labels. Where the image is a 784-D (28*28) NumPy array, the callout is a number between 0-9. The following code shows how to use this data set.
import cPickle, gzip, numpy# Load the datasetf = gzip.open(‘mnist.pkl.gz‘‘rb‘)train_set, valid_set, test_set = cPickle.load(f)f.close()
When working with datasets, we generally divide it into several minibatch (see random gradient descent). We encourage you to save the dataset as a shared variable and access it based on the index of the Minibatch (the size of the fixed batch). This is done to play the GPU advantage. When replicating data to the GPU, there is a significant cost (latency). If you replicate the data according to the program request (each batch is copied separately), rather than by sharing the variables, the program above the GPU is no faster than running on the CPU. If you use Theano shared data, you make it possible for Theano to replicate all the data to the GPU through a single call. After all, the GPU can get any data it wants from the shared variable, rather than copying it from the CPU's memory, thus avoiding the delay. Since the data and their label formats are different (the labels are usually integers and the data are real numbers), we recommend that the data and labels use different shared variables. In addition, we recommend that you use different shared variables for the training set, the validation set, and the test set (which in the end will form 6 shared variables).
Since the data is a variable, the smallest batch is a slice of these variables, and it is natural to define the minimum batch by index and size. The size of the batches in our settings is fixed, so you can access a batch of data through an index. The following code shows how to store and access a batch of data.
def shared_dataset(data_xy): "" "Function that loads the dataset into a shared variables The reason we store our datasets in shared variables are to Allow Theano-to-copy it into the GPU memory (when code was run on the GPU). Since copying data into the GPU is slow, copying a minibatch everytime are needed (the default behaviour if the data is Not in a shared variable) would leads to a large decrease in performance. """data_x, data_y = Data_xy shared_x = theano.shared (Numpy.asarray (data_x, dtype=theano.config.floatx)) shared_y = Thea No.shared (Numpy.asarray (data_y, DTYPE=THEANO.CONFIG.FLOATX))# When storing data on the GPU it had to be stored as floats # Therefore we'll store the labels as "Floatx" as well # (' shared_y ' does exactly that). But during our computations # We need them as ints (we use labels as index, and if they is # Floats It doesn ' t make sense) therefore instead of returning # ' shared_y ' We'll have to cast it to Int. This little hack # lets us get around this issue returnshared_x, T.cast (shared_y,' Int32 ') test_set_x, test_set_y = Shared_dataset (test_set) valid_set_x, valid_set_y = Shared_dataset (valid_set) train_set_x, train_set_y = Shared_dataset (train_set) batch_size = - # size of the Minibatch# Accessing the third minibatch of the training setdata = train_set_x[2* Batch_size:3* Batch_size]label = train_set_y[2* Batch_size:3* Batch_size]
The data is stored in a float format (Theano.config.floatX) on the GPU. To avoid this effect on the label data, we save them as float, but cast them to int when used.
Attention
If you run the code on the GPU, but the data is too big to save, the code will crash. In this case, you should save the data as a shared variable. Of course you can also reduce the size of a batch of data during training, once you have used up a batch of data and replaced it with another batch of data. In this way you minimize the number of data transfers between the CPU and the GPU.
Concept Database Concepts
We write the database as, when we need to differentiate, we will record the training set, the validation set and the test set as,
。 Validation sets are used to select models and hyper-parameters, and test sets are used to evaluate final generalization performance and unbiased comparisons of different algorithms.
This tutorial mainly deals with classification issues, and each database is composed of a series of pairs. We use superscripts to differentiate each sample in the training set: It is a training sample of medium I. Similarly, the first sample corresponds to the label. It is easy to extend these samples into other forms (such as Gaussian process regression or mixed Gaussian models).
Math Conversion
- : Unless otherwise noted, the uppercase symbol represents the matrix
- : element in column J of line I of matrix
- : All elements of line I of the matrix
- : All elements of column J of the Matrix
- : Unless otherwise noted, the lowercase letters represent vectors
- : The first element of a vector
Symbol Tables and abbreviations
- : Number of dimensions of the input data
- : Number of hidden elements in layer I
- ,: and model-related classification functions, defined as. Note that we usually omit the subscript.
- L: Number of labels
- : The logarithmic loss on the definition
- : A predictive function defined by a parameter on a dataset F experience loss
- NLL: Negative logarithmic likelihood function
- : All parameters for a given model
Python namespaces
Tutorials often use the following spaces:
import theanoimportas Timport numpy
Get started with supervised deep learning
In deep learning, unsupervised learning of deep networks has been widely used. But supervised learning still plays an important role. The usefulness of unsupervised learning is often assessed using supervised fine-tuning. This section briefly reviews the supervised learning model of the classification problem, and overrides the random gradient descent algorithm used to fine tune the tutorial. See the Gradient Descent Learning section for others.
Learning Classifier 0-1 Loss
Deep learning models are often used to classify. The goal of training such a classifier is to minimize errors on unknown samples. If it is a predictive function, the loss can be written as:
This is either a training set or a validation set (with unbiased evaluation of the performance of the validation set and test set). I is the indicator function, defined as:
In this tutorial, F is defined as:
In Python, the use of Theano can be written as:
# zero_one_loss is a Theano variable representing a symbolic# expression of the zero one loss ; to get the actual value this# symbolic expression has to be compiled into a Theano function (see# the Theano tutorial for more details)zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))
Negative logarithm loss likelihood function
because the 0-1 loss function is not micro, in a complex problem with thousands of or even tens of thousands of parameters, the solution of the model becomes very difficult. So we maximize the logarithmic likelihood function of the classifier:
The likelihood of the correct category is not exactly the same as the number of correct predictions, but, from the point of view of the randomly initialized classifier, they are very similar. But keep in mind that the likelihood function and the 0-1 loss function are different, you should see their correlation on the validation data, sometimes one is bigger, the other lowercase is sometimes the opposite.
Since we can minimize the loss function, the learning process, which is the process of minimizing the negative logarithmic likelihood function, is defined as:
The negative logarithm likelihood function of our classifier is actually a differential alternative to the 0-1 loss function, So we can use it to train the classifier in the gradient of the training set. The corresponding code is as follows:
# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic# expression has to be compiled into a Theano function (see the Theano# tutorial for more details)NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the# elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this# syntax to retrieve the log-probability of the correct labels, y.
Random gradient descent
What is the general gradient drop? If we define a loss function, this method moves the parameter slightly downward on the error plane to achieve the optimal goal. By gradient descent, the training data reaches the extremum above the loss function, and the corresponding pseudo-code is as follows:
# GRADIENT DESCENTwhileTrue: loss = f(params) # compute gradient params -= learning_rate * d_loss_wrt_params ifis met>: return params
Random gradient descent (SGD) follows similar principles, but it uses only a small amount of training data each time the gradient is estimated, thus processing is faster and the corresponding pseudo-code is the following:
# STOCHASTIC GRADIENT DESCENTforin training_set: # imagine an infinite generator # that may repeat examples (if there is only a finite training set) loss = f(params, x_i, y_i) # compute gradient params -= learning_rate * d_loss_wrt_params ifis met>: return params
Our advice in deep learning is to use a variant of the random gradient descent: Batch random gradient descent Minibatch SGD. In Minibatch SGD, we use multiple training data each time to estimate the gradient. This technique reduces the estimated gradient variance and takes full advantage of the hierarchical organizational techniques of memory in today's computer architecture.
forin train_batches: # imagine an infinite generator # that may repeat examples loss = f(params, x_batch, y_batch) # compute gradient using theano params -= learning_rate * d_loss_wrt_params ifis met>: return params
There is a tradeoff when selecting the size of the batch. The reduction in the number of variables and the use of SIMD is very effective when it is raised from 1 to 2 o'clock, but it is very little effective soon. A large amount of time is wasted on the variable that reduces the gradient estimate, rather than on the calculation of the gradient. The best is model-independent, data-independent, and hardware-independent, and can be any number from 1 to hundreds of. In this tutorial we set it to 20 (then the egg, there is no basis).
Attention
If you train only fixed algebra, the size of the batch becomes important because it controls the number of times the parameter is updated. For the same model. A batch training of 10 generations and 20 sizes with 1-size batches will have completely different results. Be sure to keep this in mind when switching the batch size used.
All of the above shows the process of pseudo-code of the algorithm. The implementation of this algorithm in Theano is as follows:
# Minibatch Stochastic Gradient descent# Assume loss is a symbolic description of the loss function given# The symbolic variables params (shared variable), X_batch, Y_batch;# Compute gradient of loss with respect to paramsD_loss_wrt_params = T.grad (loss, params)# Compile the MSGD step into a Theano functionUpdates = [(params, params-learning_rate * d_loss_wrt_params)]MSGD = Theano.function ([X_batch,y_batch], loss, updates=u Pdates) for(X_batch, Y_batch)inchTrain_batches:# here X_batch and Y_batch is elements of train_batches and # therefore numpy arrays; function MSGD also updates the paramsPrint' Current loss ', MSGD (X_batch, Y_batch))ifStopping_condition_is_met:returnParams
Regularization
In addition to the optimization of machine learning there is a more important task. The goal of our training model is to get good performance on new samples, not those that have already been seen. The above cycle of random gradient descent does not take this into account and may be over-fitted to the training sample. One effective way of customer service overfitting is regularization. There are a lot of different options, and what we're saying here is l1/l2 regularization and early end training.
Regularization of L1 and L2
The
L1 and L2 regularization includes another function on the loss function to punish those determined parameter settings. Formally, if the loss function is:
then the regularization loss would be:
or, here:
where
It is the norm. is a hyper-parameter that controls the importance of regularization parameters. The values that are used frequently are 1 and 2, and the term is the L1 and L2 norm. If p=2, it is also known as the weight decay.
in practice, adding regular items will encourage smoother mappings (with greater penalties for those large parameters, reducing the non-linear parts of the network model). More bluntly, NLL and the two correspond to a solution that is good for modeling data and is simple or smooth. In order to follow the principle of the Ames razor, this minimization will result in the simplest solution to the model being fitted. The
Note that the solution is simple and does not assume that it has a good generalization capability. Experience has found that providing regular items will improve the generalization performance of the network, especially on small datasets. The following code snippet shows how to calculate the loss in Python, including L1 and L2:
# symbolic Theano variable that represents the L1 regularization termL1 = T.sum(abs(param))# symbolic Theano variable that represents the squared L2 termL2_sqr = T.sum(param2)# the lossloss = NLL + lambda_1 * L1 + lambda_2 * L2
End training Early
Early end training avoids overfitting by monitoring in the validation set. The validation set is a collection of samples that we have never used for gradient descent, but it is not a collection of samples in the test set. A sample of the validation set is considered to represent a sample of the future test set. We can use them to verify that they are not part of a training set or a test set. If the performance of the model does not continue to improve on the validation set, or if it declines, it is intuitively necessary to abandon further optimizations.
When to stop training requires judgment and some intuition exists, but this tutorial is designed to use a strategy based on the number of tolerance increases.
# early-stopping ParametersPatience = the # look as this many examples regardlessPatience_increase =2 # Wait This much longer if a new best is # foundImprovement_threshold =0.995 # A relative improvement of this much is # considered significantvalidation_frequency = min (n_train_batches, patience/2)# go through this many # minibatches before checking the network # on the validation set; # Check every epochBest_params = Nonebest_validation_loss = Numpy.inftest_score =0.Start_time = Time.clock () done_looping = Falseepoch =0 while(Epoch < N_epochs) and (not done_looping):# "1" for first epoch, "N_epochs" for the last epochEpoch = epoch +1 forMinibatch_indexinchXrange (n_train_batches): D_loss_wrt_params =... # Compute Gradientparams-= Learning_rate * D_loss_wrt_params# gradient Descent # iteration number. We want it to start at 0.ITER = (Epoch-1) * n_train_batches + Minibatch_index# Note that if we do ' iter% validation_frequency ' it 'll be # True for iter = 0 which we does not want. We Want it true for # iter = validation_frequency-1. if(ITER +1)% Validation_frequency = =0: This_validation_loss =... # Compute Zero-one loss on validation set ifThis_validation_loss < best_validation_loss:# Improve patience if loss improvement is good enough ifThis_validation_loss < Best_validation_loss * Improvement_threshold:patience = max (Patience, ITER * patience_increase) Best_params = copy.deepcopy (params) Best_validation_loss = This_validati On_lossifPatience <= iter:done_looping = True Break# postcondition:# Best_params refers to the best out-of-sample parameters observed during the optimization
If we go beyond tolerance, those we need to go back to the beginning of training and then repeat.
Attention
Validation_frequency should always be smaller than patience. The code should be checked at least two times. This is also why we set validation_frequency = Min (value, PATIENCE/2.)
Attention
This algorithm can achieve better performance by using statistical tests rather than simple comparisons.
Test
At the end of the loop, the best_params variable refers to the model that gets the best performance on the validation set. If we use this process on another model, we may get another result. If you need to choose the best model, we need to compare each model. When we select the final model, we will report on its performance. This is our performance on an unknown sample.
Recap
This is the focus of the optimization section. Early end training requires that we divide the sample into three sets (training set, validation set, and test set). The training set is used to optimize the objective function using random gradient descent. As the process progresses, we periodically consult the validation set to verify that our model is really getting better. When a good performance is marked on the validation set, we save it. When there is no good model for a long time, we discard it and retrain it.
Thenao/pyhton tips loading and saving models
When you do the experiment, it may take several hours (sometimes several days) to select the best parameters by gradient descent. You want to save the found parameters. You also want to keep the best estimate of the current as the search progresses.
Packaging ndarrays in NumPy from shared variables
The best way to save model parameters is to use the deepcopy in pickle or ndarrays. For example, your shared variable parameter is W,v,u, which you can save with the following command:
>>> save_file = open(‘path‘)>>> w.set_value(cPickle.load(save_file), borrow=True)>>> v.set_value(cPickle.load(save_file), borrow=True)>>> u.set_value(cPickle.load(save_file), borrow=True)
This technique is a bit verbose, but it's very useful and correct. You can also use Matplotlib to save for a few years.
do not keep training or testing functions long-term.
Theano functions are compatible with Python's deepcopy and pickle mechanisms, but you cannot save Theano functions. If you update your Theano folder or have internal changes, you may not be able to load the previously saved model. Zheano is still active in the development, the internal API may be subject to change. So for the sake of safety, don't keep the training or the test function for a long time. The pickle mechanism is intended for short-term preservation, such as temporary files, or distributed jobs.
Draw a result diagram
Visualization is an important tool for understanding models or training algorithms. You may have tried to add Matplotlib's drawing commands to the PIL's rendering commands to the training script. However, soon you will find the interesting aspects of these pre-rendered images and investigate the parts of the images that are not clear. You were hoping to save the original model.
If you have enough space, your training script will save the middle model, as well as a visual script that can handle these models.
You already have a model-is the saved function correct? Try to save the middle model again.
Libraries you might want to know: Python Image Library (PIL), matplotlib.
Reference:
1.DeepLearning 0.1 Documentation
2.theano Learning Guide 1 (translation)
3.OpenDL Let's not forget to give star ha
Theano Study Guide