/* author:cyh_24 */
/* date:2014.10.2 */
/* Email: [Email protected] */
/* more:http://blog.csdn.net/cyh_24 */
Recently, the focus of the study in the image of this piece of content, the recent game more, in order not to drag the hind legs too much, decided to study deeplearning, mainly in Theano the official course deep Learning tutorial for reference.
This series of blog should be continuously updated, I hope that we have a lot of guidance, learn together!
Reprint Please specify: http://blog.csdn.net/cyh_24/article/details/41827691
1. Download code and data
Theano official provides all the code and data sets for learning.
You can clone it from GitHub :
git clone git://github.com/lisa-lab/deeplearningtutorials.git
2. Use of Datasets 2.1 MNIST DataSet
(mnist.pkl.gz)
The Mnist dataset consists of a handwritten digital image, divided into 60,000 training samples and 10,000 test samples.
In many papers, as with this tutorial, the official 60,000 training samples are divided into 50,000 training samples and 10,000 validation samples
Validation samples: For selecting hyper-parameters like learning rate and model size
All handwritten digital image sizes are normalized in a fixed 28*28-sized grayscale image.
Here are some examples of mnist handwritten numerals:
2.2 Using Datasets You can import data from the following code:
Import Cpickle, gzip, numpy# Load the DATASETF = Gzip.open (' mnist.pkl.gz ', ' RB ') Train_set, valid_set, Test_set = CPICKLE.L Oad (f) f.close ()
It is important to note that when we use these datasets, we usually divide them into smaller chunks (called minibatcheds). And, officially, we should store the data in shared variables (GKFX variables) After we read the data. The reason for using shared variables is related to using the GPU. Copying data from the CPU to the GPU is a big efficiency bottleneck when copying data to GPU memory. When the code is copying data, if no shared variable is used, because of this bottleneck, the GPU code will not be faster (or even slower) than the CPU code.
Here's how to use shared variables:
def shared_dataset (data_xy): data_x, data_y = data_xy shared_x = theano.shared (Numpy.asarray (data_x, dtype= Theano.config.floatX)) shared_y = theano.shared (Numpy.asarray (data_y, DTYPE=THEANO.CONFIG.FLOATX)) return shared_x, T.cast (Shared_y, ' int32 ') #使用共享变量test_set_x, test_set_y = Shared_dataset (test_set) valid_set_x, valid_set_y = Shared_dataset (Valid_set) train_set_x, train_set_y = Shared_dataset (train_set) batch_size = # size of the minibatch# AC Cessing the third minibatch of the training SetData = train_set_x[2 * 500:3 * 500]label = train_set_y[2 * 500:3 * 500]
Note:
- If you run on the GPU, the data set used exceeds the size of the memory, and the code crashes. In this case, you need to save the data to a shared variable.
- When training data, you can save as small enough chunks to shared variables. Once you get the data through the data block, you need to save the updated value.
- This method maximizes the efficiency of data transfer between CPU and GPU memory.
3. Convention name 3.1 Data set noun
Represents training sets, validation sets, test sets, respectively.
For each dataset, the data x is composed of the label Y:
3.2 Mathematical Definitions
3.3 Symbols and abbreviations
4. Introduction to Deeplearning's supervised learning
The most exciting thing about deep learning it uses a lot of unsupervised to learn neural networks. However, supervised learning still plays an important role in it. In general, unsupervised learning can achieve better results after monitoring the fine tuning of learning. This section of the supervised learning primer is mainly about several aspects:
- Brief introduction of the role of supervised learning in the classification model
- Introduction of Batch random gradient descent algorithm
4.1 Learn a classifier 4.1.1 Zero-one Loss
The purpose of the training classifier is to minimize the classification errors (Zero-one Loss).
If the prediction function is:
Then the cost of its error is:
Here,D is the training set, the meaning ofI is as follows:
When x is true, it is 1; otherwise 0
In this tutorial,f (Predictive functions) is defined as follows:
We can use Theano to write the above code:
Zero_one_loss = T.sum (T.neq (T.argmax (p_y_given_x), y))
4.1.2 Log-likelihood Loss (negative logarithmic likelihood cost function)
Because zero-one Loss is non-differentiable, the computational amount is quite astonishing in order to optimize it in large models (thousands of parameters). Therefore, we use Log-likelihood loss to replace:
Also, by minimizing the likelihood function to get optimized, minimizing the process we become NLL (negative log-likelihood), as follows:
Again, the code written with Theano is still simple:
NLL =-t.sum (T.log (p_y_given_x) [T.arange (Y.shape[0]), Y])
4.2 Stochastic Gradient descent (random gradient descent)
The normal gradient descent method
Gradient Descent method is to use negative gradient direction to determine the new search direction of each iteration, so that each iteration can reduce the objective function to be optimized gradually.
Pseudo-code of the normal gradient descent method :
# GRADIENT Descentwhile True: loss = f (params) d_loss_wrt_params = ... # compute GRADIENT params- = Learning_r ATE * d_loss_wrt_params if <stopping condition is met>: return params
Random Gradient Descent method
A random gradient descent is an iteration through one sample (rather than the entire training set), and if the sample size is large (for example, hundreds of thousands of), then it is possible to iterate the theta to the optimal solution by using only tens of thousands of or thousands of samples.
However, one of the problems with SGD is that there is a lot of noise , so that SGD is not each iteration toward the overall optimization direction.
SGD pseudo-code is as follows :
# STOCHASTIC GRADIENT descentfor (x_i,y_i) in Training_set: # Imagine a infinite generator # that could repeat exam Ples (if there is only a finite training loss = f (params, x_i, y_i) d_loss_wrt_params = ... # compute gradient par AMS-= learning_rate * D_loss_wrt_params if <stopping condition is met>: return params
Bulk Random Descent method
An optimized variant of the random gradient descent is recommended in the Theano deep learning textbook, called Bulk Random descent (Minibatch SGD). The basic working principle of bulk random descent is the same as the descent of the machine, except that each time a more training sample is used to estimate the gradient. This effectively reduces the variance of the estimated gradient, while making better use of the computer's memory.
The MSGD pseudo code is as follows:
for (X_batch,y_batch) in train_batches: # Imagine a infinite generator # that may repeat examples loss = f (PA Rams, X_batch, y_batch) d_loss_wrt_params = ... # compute gradient using Theano params- = learning_rate * D_loss_w Rt_params If <stopping condition is met>: return params
Attention:
In this case, the size of the batch B is a trade-off choice, need to consider models, datasets, hardware, etc., the text selection is 20, of course you can choose arbitrarily (will not affect the correctness of the results).
However, if your training times are fixed, the size of the batch will affect the results.
Below, we use Theano to write a complete MSGD:
# Minibatch Stochastic Gradient descent# Assume loss is a symbolic description of the loss function given# the symbolic VA Riables params (shared variable), X_batch, y_batch;# compute gradient of loss with respect to Paramsd_loss_wrt_params = T. Grad (loss, params) # Compile the MSGD step into a Theano functionupdates = [(Params, params-learning_rate * D_LOSS_WRT_PA Rams)]MSGD = Theano.function ([X_batch,y_batch], loss, updates=updates) for (X_batch, Y_batch) in train_batches: # Here X_batch and Y_batch is elements of train_batches and # therefore numpy arrays; function MSGD also updates the PA Rams print (' Current loss is ', MSGD (X_batch, Y_batch)) if Stopping_condition_is_met: return params
4.3 Rule of law
Batch random gradient descent is a linear update algorithm that can be used to train existing models using new samples. This is convenient, but there is a possibility of overfitting . To prevent the appearance of overfitting, we can use two methods:
- the method of regulation ;
- training to a certain time early knot bundle;
4.3.1 of rules
The L1 and L2 method is to punish some parameter configurations by adding a new penalty to the cost function (the rule is also called weight decay).
In general, if our cost function is as follows:
Then, the cost function after the rule is:
In our example, the cost function after the rule is:
In principle, when adding a rule-to-cost function, the mapping of the neural network will be smoother (by punishing large parameters, which will reduce the nonlinearity of the network model).
Look at the implementations in Python:
# Symbolic Theano variable that represents the L1 regularization termL1 = t.sum (ABS (param)) # Symbolic Theano variable Represents the squared L2 terml2_sqr = t.sum (param * * 2) # The Lossloss = NLL + lambda_1 * L1 + lambda_2 * L2
4.3.2 Early termination
validation Set : Not used for gradient descent training, nor is it part of the test set. You can think of it as a Test representative.
We can verify the performance of the model by validating the set to determine if there has been a fitting situation. If the performance of the model does not have a significant increase in the validation set, or even falls, then we should stop continuing the iteration.
5 Introductory language
Python so elegant, Theano so ' library ', deep-learning so tide, not much to say, I also end early ...
Theano Deep Learning (i)----installation and use