A recurrent neural network (RNN) is a class of neural networks that includes weighted connections within a layer (compared With traditional Feed-forward networks, where connects feeds only to subsequent layers). Because Rnns include loops, they can store information while processing new input. This memory makes them ideal for processing tasks where prior inputs must to considered (such as time-series data). For this reason, the current deep learning networks are based on Rnns. This tutorial explores the ideas behind Rnns and implements one to scratch for series data prediction.
Neural networks are computational structures "map an" to "a output based on a network of highly connected process ing elements (neurons). For a quick primer on neural networks, you can read another to my tutorials, "a neural networks deep dive," which looked a T Perceptrons (the building blocks of neural networks) and multilayer perceptrons with back-propagation learning.
In the prior tutorial, I explored the Feed-forward network topology. In this topology, shown in the following figure, your feed an input vector into the network through the hidden layers, and It eventually results in a output. In this network, the input maps to the output (every time the "input is applied") in a deterministic way.
But, say so you ' re dealing with time-series data. A single data point in isolation isn ' t entirely useful because it lacks important (for attributes, is the data example IES changing? Growing? Shrinking?). Consider a natural language processing application in which letters or words the represent input. When you consider the understanding words, letters are important in the context. These are inputs aren ' t useful in isolation but to the context of what occurred-before them.
Applications of time-series data require a new type of topology that can consider the history of the input. This is where you can apply Rnns. An RNN includes the ability to maintain internal memory, with feedback and therefore support. In the following example, the hidden layer output are applied back into the hidden layer. The network remains Feed-forward (inputs are applied to the hidden layer, and then the output layer), but the RNN maintain s internal state through context nodes (which influence the hidden layer on subsequent).
Rnns aren ' t a single class of network but rather a collection of topologies of this apply to different problems. One interesting aspect of recurrent networks is this with enough layers and nodes, they are Turing complete, which means t Hat they can implement any computable function. Architectures of Rnns
Rnns were introduced in the 1980s, and their ability to maintain memory of past inputs new opened problem to domains L Networks. Let's explore a few of the architectures that can be use. Hopfield
The Hopfield network is a associative memory. Given an input pattern, it retrieves the most similar the "for" input. This association (connection between the input and output) mimics the operation of the human brain. Humans are able to fully recall a memory given partial of it, and the aspects Hopfield-network in a operates similar.
Hopfield networks are binary in, with individual neurons on (firing) or off (not firing). Each neuron connects to every neuron through a weighted connection ("The following image"). Each neuron serves as both the input and the output. At initialization, the network are loaded with a partial pattern, and then each neuron is updated until the network Converg ES (which it is guaranteed to do). The output is provided on convergence (the state of the neurons).
Hopfield networks are able to learn (through Hebbian learning) multiple patterns and converge to recall the closest patter n Given the presence of noise in the inputs. Hopfield networks aren ' t suitable for time-domain problems but rather are recurrent in. Simple Recurrent networks
Simple recurrent networks are a popular class of recurrent networks then includes a state layer for introducing The network. The state layer influences the next stage of input and therefore can is applied to time-varying of data.
can apply statefulness in various ways, but two popular approaches are the Elman and Jordan networks (the Followin g image). In the case of the Elman network, the hidden layer feeds a state layer to the context nodes that retain memory of past. As shown in the following figure, a single set "context nodes exists" maintains memory of the prior hidden layer re Sult. Another popular topology is the Jordan network. Jordan networks differ in that instead maintaining history of the hidden layer they store the output layer to the STA Te layer.
Elman and Jordan networks can is trained through standard back-propagation, and each has been the to applied sequence Gnition and natural language processing. Here's a single state layer has been introduced, but it's easy to-you-could add state layers where the St Ate layer output acts as the input for a subsequent state layer. I explore this idea into the context of the Elman network later in this tutorial. Other networks
Work on Recurrent-style Networks has is not stopped, and today, recurrent architectures-are setting the standard for Operatin G on time-series data. The long Short-term memory (LSTM) approach in deep learning has been used and convolutional networks to describe in Gener Ated language The content of images and videos. The lstm includes a forget-gate that lets your "train" individual neurons about what ' s important and how long it'll remai N Important. LSTM can operate on data where important events can is separated by long periods of time.
Another recent architecture is called the gated Recurrent unit (GRU). The GRU is a optimization of the lstm that requires fewer parameters and. RNN Training Algorithms
Rnns have unique training algorithms because of their nature of incorporating historical information in time or Sequen Ce. Gradient descent algorithms have been successfully-applied to RNN weight (to optimization the error by Minimize The weight in proportion to the derivative of the error of this weight). One popular technique is back-propagation through time (BPTT), which applies weight updates by summing the weight updates of accumulated errors for each element in a sequence, and then updating the weights on the end. For large input sequences, this behavior can cause weights to either vanish or explode (called the Vanishing or exploding Gradient problem). To combat this problem, hybrid approaches are commonly used in which bptt are combined with other algorithms such as Real-t IME Recurrent learning.
Other training methods can also is successfully applied to evolving rnns. Evolutionary algorithms can be applied (such as Genetic algorithms or simulated annealing) to evolve populations of candidate Rnns, and then recombine them as a function of their fitness (that's, their ability to solve given). Although not guaranteed to converge on a solution, they can is successfully applied to a range of problems, including RNN Evolution.
One useful application of Rnns is the prediction of sequences. In the following example, I builds a RNN that I can-predict the last letter of a word given a small vocabulary. I ' ll feed the word into the RNN, one letter at a time, and the output of the network'll represent the predicted next let ter. Genetic Algorithm flow
Before jumping into the RNN example, let's look at the process behind genetic. The genetic algorithm is a optimization technique that are inspired by the process of natural selection. As shown in the following figure, the algorithm creates a random population of candidate solutions (called chromosomes) th At encode the parameters of the solution being sought. After they are created, the population is tested against the problem and a fitness value assigned. Parents are then identified from-population (with higher fitness being preferred) and a child chromosome created for T He next generation. During the child's generation, genetic operators are applied (such as taking elements from each parent [called crossover] and introducing random changes into the child [called mutation]). The process then begins again with the new population until a suitable candidate solution is found.representing neural networks in a population of chromosomes
A chromosome is defined as the individual member of the population and contains A encoding for the particular problem to Be solved. In the context of evolving a RNN, the chromosome is made up to the weights of the RNN, as shown in the following.
Each chromosome contains a 16-bit value per weight. The value, in the range 0-65535, are converted to a weight by subtracting half the range, and then it by 0.00 1. This means which encoding can represent values in the range-32.767 to 32.768 in increments of 0.001.
The process of taking a chromosome from the population and generating a RNN is simply defined as initializing the weights The network with the translated weights from the chromosome. In this example, this represents 233 individual weights. Letter prediction with Rnns
Now, let's explore the application of letters to a neural network. Neural networks operate on numerical values, so some representation are required to feeds a letter into a network. For this example, I use One-hot encoding. One-hot encoding converts a a vector in which a, the vector is set. This encoding creates a distinct feature which can be used mathematically-for The example, each letter represented the gets its own Weight applied within the network. While in this implementation, I represent letters through One-hot; Natural language processing applications represent words same in the fashion. The following figure illustrates the One-hot vectors used in this example and the vocabulary for used.
So, now I have a encoding that would allow my RNN to work with letters. Now, let's look at the RNN. The following figure illustrates the Elman-style RNN in the context of letter prediction (feeding the one-hot vector repre Senting the Letter b). For the "in" Test word, I encode the letter as a one-hot, and then feeds it as the input to the network. The network is then executed in a feed-forward fashion, and the output are parsed in a winner-takes-all fashion to determin E The winning element that defines the One-hot vector (in this example, the Letter a). In this implementation, only the "last letter of the" word is checked; The other letters are ignored from being validated, nor are they the fitness. Simple Elman-style RNN implementation
Let's look at the sample implementation of a Elman-style RNN trained through a genetic algorithm. Can find the Linux source code for this implementation at GitHub. The implementation is made up of three files:main.c, which provides the main loop and a function to test and derive the F Itness of a population ga.c, which implements the genetic algorithm functions RNN.C, which implements the actual RNN
I focus on two core functions:the genetic algorithm process and RNN evaluation function
The meat of the RNN is found in The rnn_feed_forward function, which implements the execution of the RNN net Work (the following code). This function is split into three stages and mirrors the network shown in the previous image. In the stage, I calculate the outputs of the hidden layer, which incorporates the input layer and the context layer (Each and its own set of weights). The context nodes are initialized to zero before testing a given word. In the second stage, I calculate the outputs of the output layer. This step incorporates the hidden layer neuron with its own distinct. Finally, in the third stage, I propagate the ' the ' context-layer neuron second to the Context-layer Layer output to the. This step implements the two layers of memory within the network.
The hidden layer, I use the tan function as my activation function and the sigmoid function as the activation function in the output layer. The TAN function is useful in the hidden layer because it has the range-1 to 1 (allowing both and positive negative TS from the hidden layer). In the output layer, where I ' m interested in the largest value to activate the One-hot vector, I use the sigmoid i TS Range is 0 to 1.