In the past decade, there has been a surge in interest in machine learning. Almost every day, we can see discussions about machine learning in a variety of computer science courses, industry conferences, the Wall Street Journal, and more. In all discussions about machine learning, many people confuse what machine learning can do with what they want machine learning to do. Fundamentally, machine learning uses algorithms to extract information from raw data and present it in some type of model. We use this model to infer other data that has not been modeled.

Neural networks are a model of machine learning that have been at least 50 years old. The basic unit of a neural network is a node, essentially inspired by biological neurons in the mammalian brain. The connections between neurons are also modeled by the biological brain, and the way these connections develop over time is "training."

In the mid-1980s and early 1990s, many important model architecture advancements were made in neural networks. However, the amount of time and data required to achieve good performance is greatly reduced, which greatly reduces the interest of researchers. In the early 21st century, computing power increased exponentially, and researchers saw the "Cambrian explosion" of computer technology. As an important contender in this field, deep learning, because of the explosive growth of computing power, has won many important machine learning competitions. As of now, this trend has not diminished; today, we see deep learning in every corner of machine learning.

Recently, I started reading academic papers about this deep learning. According to my research, the following are some publications that have had a huge impact on the development of the field:

● New York University's gradient-based learning is applied to document recognition (1998), which introduces convolutional neural networks into the machine learning world.

● DeepBoltzmann Machines (2009) at the University of Toronto, which provides a new learning algorithm for Boltzmann machines, including many hidden variable layers.

● Stanford and Google use advanced unsupervised learning to build advanced features (2012) that solve the problem of building advanced, class-specific feature detectors using only unlabeled data.

Berkeley's DeCAF, a deep convolution activation feature for universal visual recognition (2013), which released DeCAF, an open source implementation of deep convolution activation, and all related network parameters for visual research People are able to conduct in-depth experiments across a range of visual concept learning paradigms.

● DeepMind uses Deep ReinforcementLearning (2016) to play Atari, which provides the first deep learning model that can be used to learn learning strategies directly from high-dimensional sensory input using reinforcement learning.

Through research and study papers, I have learned a lot about deep learning. Here, I want to share 10 powerful deep learning methods that AI engineers can apply to machine learning problems. But first, let's define what deep learning is. Deep learning is a challenge for many people because its form has gradually changed over the past decade. In order to better explain the status of deep learning, the following diagram illustrates the concept of the relationship between artificial intelligence, machine learning and deep learning.

The field of artificial intelligence is widespread and has been around for a long time. Deep learning is a subset of the field of machine learning, and machine learning is only a sub-area of artificial intelligence. Differentiate the deep learning network from the previous feedforward multi-layer network:

● Deep learning more neurons than previous networks;

● There are more complex ways of connecting layers in deep learning;

● The computing power provided by the "Cambrian Explosion";

● Deep learning can automatically perform feature extraction.

When I say "more neurons", it means that the number of neurons has increased in recent years, and deep learning can represent more complex models. The layer also evolves from the complete connection of each layer in the multi-layer network to the local connection of the neuron fragments in the convolutional neural network and the cyclical connection to the same neuron in the recurrent neural network (except for the connection to the previous layer) .

Deep learning can be defined as a neural network with a large number of parameters and layers:

● Unsupervised pre-training network;

● Convolutional neural networks;

● Circulating neural networks;

● Recurrent neural networks.

In this article, I mainly explain the latter three networks. The Convolutional Neural Network (CNN) has basically crossed the standard neural network that extends the space using shared weights. CNN aims to identify an image by convolution inside, which sees the edge of the object on the image. A recurrent neural network is basically a standard neural network that uses time-extended extended space, which extracts the edge into the next time step, rather than entering the next layer at the same time. The RNN performs sequence identification, such as speech or text signals, because of its internal loop, meaning that there is short-term memory in the RNN network. Recurrent neural networks are more similar to hierarchical networks where the input sequence is virtually time-independent, but the inputs must be layered in a tree-like manner. The following 10 methods can be applied to all of these architectures.

1-back propagation

Back-prop backpropagation is simply a method of simply calculating the partial derivative of a function, which has the form of a combination of functions (as in a neural network). When you use a gradient-based approach to solve the optimization problem (gradient descent is just one of them), you want to calculate the function gradient at each iteration, and it will work.

For neural networks, their objective functions have a combined form. How do you calculate the gradient? There are two common ways to do this: (i) Analyze the differential method. If you know the form of the function, you only need to use the chain rule (basic calculus) to calculate the derivative. (ii) Approximate differentiation of finite difference. This method is computationally expensive because the number of evaluation functions is O(N), where N is the number of parameters. The computational cost of this approach is expensive compared to analytical differentiation. When debugging, finite difference is often used to verify the performance of backpropagation.

2-random gradient descent

The intuitive way to imagine a gradient drop is to imagine the path of a river from the top of the mountain. The goal of the gradient decline is exactly what the river strives to achieve, from the top to the lowest point.

Now, if the mountain's topographical shape makes it unnecessary for the river to stop completely anywhere before it reaches its final destination, this is the ideal situation we want. In machine learning, this is equivalent to saying that we have found the global minimum (or optimal) of the solution from the initial point (the top of the mountain). However, due to the nature of the terrain, there may be some potholes in the river path that will force the river to become trapped and stagnant. In machine learning, this pit is called a local optimal solution, which is something we don't want. Of course, there are many ways to solve the local optimal solution problem, I am not going to discuss it further here.

Therefore, gradient descent tends to fall into local minima depending on the nature of the terrain (or a function in ML terminology). However, when you have a special mountain shape (shaped like a bowl, called the convex function in ML terminology), the algorithm always finds the optimal value. You can imagine visualizing this river. In machine learning, these special terrains (also known as convex functions) always need to be optimized. In addition, the position from the top of the mountain (that is, the initial value of the function) is different, and the path to the bottom of the mountain is completely different. Similarly, depending on the flow rate of the river (ie, the learning rate or step size of the gradient descent algorithm), you may arrive at the destination in different ways. Whether you will fall into or avoid a pit (local minimum) will be affected by these two criteria.

3-learning rate attenuation

Adjusting the learning rate of the stochastic gradient descent optimization program can improve performance and shorten training time. Sometimes this is also known as learning rate annealing or adaptive learning rate. The simplest and most commonly used learning rate adjustment during training is a technique that reduces the learning rate over time. Using a large learning rate value at the beginning of training can greatly adjust the learning rate; in the later stage of training, the learning rate is reduced, and the model is updated at a smaller rate. This technique can be quickly learned to get some better weights at an early stage, and fine-tuning the weights later.

Two popular and easy to use learning rates are attenuated as follows:

● Gradually reduce the learning rate at each step.

● Use a large drop in a specific period to reduce the learning rate.