q-learning Algorithm Learning-1

Last Update:2015-12-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learn from the website below.

Http://mnemstudio.org/path-finding-q-learning-tutorial.htm

His tutorial introduces the concept of q-learning through a simple but comprehensive numerical example. The example describes an agent is which uses unsupervised training to learn on an unknown environment. You might also find it helpful-compare this example with the accompanying source code examples.

Suppose we have 5 rooms in a building connected by doors as shown in the figure below. We ' ll number each of the 0 through 4. The outside of the building can be thought of as one big (5). Notice that doors 1 and 4 leads into the building from the 5 (outside).

We can represent the rooms on a graph, with each class as a node, and each door as a link.

For this example, we ' d like to put a agent in any of the, and from that, go outside the building Get). In and words, the goal is number 5. To set the This class as a goal, we ll associate a reward value to each door (i.e. link between nodes). The doors that leads immediately to the goal has an instant reward of 100. Other doors does directly connected to the target has zero reward. Because Doors is two-way (0 leads to 4, and 4 leads back to 0), and the arrows is assigned to each. Each arrow contains an instant reward value, as shown below:

Of course, 5 loops back to itself with a reward of, and all other direct connections to the goal a carry ARD of 100. In Q-learning, the goal are to reach the state with the highest reward, so if the agent arrives at the goal, it'll r Emain there forever. This type of goal was called an "absorbing goal".

Imagine our agent as a dumb virtual robot that can learn through experience. The agent can pass from one of the another but have no knowledge of the environment, and doesn ' t know which sequence of do ORS leads to the outside.

Suppose we want to model some kind of simple evacuation of a agent from any of the the building. Now suppose we have a agent in the 2 and we want the agent to learn to reach outside the House (5).

The terminology in q-learning includes the terms "state" and "action".

We'll call each of the including outside, a "state", and the agent's movement from one of the to another would be a "action". In our diagram, a ' state ' is depicted as a node and while the ' action ' is represented by the arrows.

Suppose the agent is in State 2. From the state 2, the It can go to the state 3 because state 2 are connected to 3. From state 2, however, the agent cannot directly go to state 1 because there is no direct door connecting 1 and 2 (th us, no arrows). From state 3, it can go either to state 1 or 4 or back to 2 (look at all the arrows on state 3). If the agent is in a state 4 and then the three possible actions be to go to state 0, 5 or 3. If the agent is in a State 1, it can go either to state 5 or 3. From the state 0, it can is only the go back to the state 4.

We can put the state diagram and the instant reward values into the following reward table, "Matrix R".

The-1 ' in the table represent null values (i.e.; where there isn ' t a link between nodes). For example, state 0 cannot go to State 1.

Now we'll add a similar matrix, "Q", to the brain of our agents, representing the memory of what the agent has learned thro Ugh experience. The rows of the matrix Q represent the current state of the agents, and the columns represent the possible actions leading to T He next state (the links between the nodes).

The agent starts out knowing nothing and the Matrix Q is initialized to zero. In this example, for the simplicity of explanation, we assume the number of States are known (to be six). If we didn ' t know how many states were involved, the Matrix Q could start off with only one element. It is a simple task to add more columns and the rows in the Matrix Q if a new state is found.

The transition rule of Q learning is a very simple formula:

Q (state, action) = R (State, action) + Gamma * Max[Q (next state, all actions)]

According to this formula, a value assigned to a specific element of the matrix Q, was equal to the sum of the corresponding VA Lue in Matrix R and the learning parameter Gamma, multiplied by the maximum value of Q for all possible actions in the NEX T state.

Our virtual agent would learn through experience, without a teacher (this is called unsupervised-learning). The agent would explore from state to state until it reaches the goal. We ' ll call each exploration an episode. Each episode consists of the agent moving from the initial state to the goal state. Each time the agent arrives is in the goal state, and the program goes to the next episode.

The q-learning algorithm goes as follows:

1. Set the gamma parameter, and environment rewards in Matrix R.

2. Initialize matrix Q to zero.

3. For each episode:

Select a random initial state.

The goal state hasn ' t been reached.

Select one among all possible.

Using This possible action, consider going to the next state.

Get Maximum Q value for this next state based on all possible actions.

Compute:q (state, action) = R (State, action) + Gamma * Max[Q (next state, all actions)]

Set the next state as the current state.

End do

End for

The algorithm above is used by the agent to learn from experience. Each episode are equivalent to one training session. In each training session, the agent explores the environment (represented by Matrix R), receives the reward (if any) unti L It reaches the goal state. The purpose of the training is to enhance the ' brain ' of our agents, represented by Matrix Q. More training results in a more optimized matrix Q. In this case, if the matrix Q have been enhanced, instead of exploring around, and going back and forth to the same rooms, The agent would find the fastest route to the goal state.

The Gamma parameter have a range of 0 to 1 (0 <= Gamma > 1). If Gamma is closer to zero, the agent would tend to consider only immediate rewards. If Gamma is closer to one, the agent would consider future rewards with greater weight, willing to delay the reward.

To use the matrix Q, the agent simply traces the sequence of States, from the initial state to goal state. The algorithm finds the actions with the highest reward values recorded in matrix Q for current state:

Algorithm to utilize the Q matrix:

1. Set Current state = initial state.

2. From current state, find the action with the highest Q value.

3. Set Current state = Next state.

4. Repeat Steps 2 and 3 until current state = Goal state.

The algorithm above would return the sequence of States from the initial state to the goal state.

Q-learning Example by Hand

To understand how the q-learning algorithm works, we'll go through a few episodes step by step. The rest of the steps is illustrated in the source code examples.

We ll start by setting the value of the learning parameter Gamma = 0.8, and the initial state as the 1.

Initialize Matrix Q as a zero matrix:

Look at the second row (state 1) of the Matrix R. There is the possible actions for the current state 1:go to state 3, or go to state 5. By the random selection, we select to go to 5 as our action.

Now let's imagine what would happen if we agent were in the state 5. Look at the sixth row of the reward matrix R (i.e. State 5). It has 3 possible actions:go to state 1, 4 or 5.

Q (state, action) = R (State, action) + Gamma * Max[Q (next state, all actions)]

Q (1, 5) = R (1, 5) + 0.8 * Max[Q (5, 1), Q (5, 4), Q (5, 5)] = 100 + 0.8 * 0 = 100

Since Matrix Q is still initialized to zero, Q (5, 1), Q (5, 4), Q (5, 5), was all zero. The result of this computation for Q (1, 5) is a because of the instant reward from R (5, 1).

The next state, 5, now becomes the current state. Because 5 is the goal state, we ' ve finished one episode. Our agents ' s brain now contains an updated matrix Q as:

For the next episode, we start with a randomly chosen initial state. This time and we have the state 3 as our initial state.

Look at the fourth row of Matrix R; It has 3 possible actions:go to state 1, 2 or 4. By the random selection, we select to go to the state 1 as our action.

Now we imagine the that we is in the State 1. Look at the second row of reward matrix R (i.e. State 1). It has 2 possible actions:go to state 3 or State 5. Then, we compute the Q value:

Q (state, action) = R (State, action) + Gamma * Max[Q (next state, all actions)]

Q (1, 5) = R (1, 5) + 0.8 * Max[q (1, 2), Q (1, 5)] = 0 + 0.8 * MAX (0, 100) = 80

We use the updated matrix Q from the last episode. Q (1, 3) = 0 and Q (1, 5) = 100. The result of the computation is Q (3, 1) = Because the reward is zero. The Matrix Q becomes:

The next state, 1, now becomes the current state. We repeat the inner loop of the Q learning algorithm because State 1 are not the same as the goal state.

So, starting the new loop with the current state 1, there is both possible actions:go to state 3, or go to state 5. By lucky Draw, we action selected is 5.

Now, the imaging we ' re in the State 5, there is three possible actions:go to the State 1, 4 or 5. We compute the Q value using the maximum value of these possible actions.

Q (state, action) = R (State, action) + Gamma * Max[Q (next state, all actions)]

Q (1, 5) = R (1, 5) + 0.8 * Max[Q (5, 1), Q (5, 4), Q (5, 5)] = 100 + 0.8 * 0 = 100

The updated entries of matrix Q, Q (5, 1), Q (5, 4), Q (5, 5), is all zero. The result of this computation for Q (1, 5) is a because of the instant reward from R (5, 1). This result does the Q matrix.

Because 5 is the goal state and we finish this episode. Our agents ' s brain now contain updated matrix Q as:

If our agents learns more through further episodes, it'll finally reach convergence values in matrix Q like:

This matrix Q, can and be normalized (i.e.; converted to percentage) by dividing all Non-zero entries by the highest numb ER (the case):

Once The matrix Q gets close enough to a state of convergence and we know our agents have learned the most optimal paths to the Goal state. Tracing the best sequences of States are as simple as following the links with the highest values at each state.

For example, from initial state 2, the agent can use the matrix Q as a guide:

From state 2 The maximum Q values suggests the action to go to state 3.

From state 3 The maximum Q values suggest and Alternatives:go to state 1 or 4. Suppose we arbitrarily choose to go to 1.

From state 1 The maximum Q values suggests the action to go to state 5.

Thus the sequence is 2-3-1-5.

q-learning Algorithm Learning-1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

q-learning Algorithm Learning-1

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

q-learning Algorithm Learning-1

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support