TensorFlow----Intensive Learning

Source: Internet
Author: User

Original: http://blog.sina.com.cn/s/blog_9409e4a3010137gm.html

Environmental modelling

Suppose there are 5 rooms in a building connected by a door as shown in the figure below, we give 5 rooms in turn named A~e. Now consider our standing outside of a large room F, and the room F covers all the rest of the building space. That is, F can enter the B or e room.


We can represent each room as a node, each door as an edge.


We want to get to a target room. If you put an agent into any room, we hope the agent can get out of the building. In other words, the target room is f. Here we introduce a return value for each door (e.g., side length in the figure). If a door can reach the target in an instant, the return value is 100 (as shown in the figure below, as shown by the Red Arrow), and the other door cannot reach F directly, and the return value is 0. Since the door is bidirectional (a can reach E, from E can also reach a), each of the two nodes has two opposite arrows pointing, each with an instantaneous return value. The diagram becomes a state-like diagram, as shown in the following figure.

Note that the F-node has a maximum return value from the loop (F back to F), so if an agent arrives at the target node, it will always stop here and we call it the attraction node (absorbing goal). Because the agent reaches this target state, it will maintain this state.


Agent, status (state), and Behavior (action) Introduction

Suppose our agent is a robot and can learn by walking. The robot went into another room from one room without knowing the environment. It is even more unknown if you choose which door to walk out of the building.

Suppose we want to model some kind of simple evacuation of a agent from any of the the building. Now let's assume that the robot is in C at the moment and hopefully the robot can go to f by learning.


How to keep robots learning from walking.

Before we talk about robotic learning (q-learning), let's look at the term State and behavior (action).

We use each room (including the target room) as state. A robot moving from one room to another is called an action. Recall the status of the above to replace the picture. The state is described by the node of the status graph, and the action is represented by an arrow.


Assuming the robot is now Statec, from Statec, the robot can only reach stated. From stated can reach Stateb,statee or back to Statec, if in Statee, the robot can have three kinds of action, that is, to Statea,statef or stated, if the robot in Stateb, it can reach Statef, You can also reach stated.

We will now put the status-loading chart and the instantaneous return value into the return table R

Action to go to state

Agent now on State

A

B

C

D

E

F

A

-

-

-

-

0

-

B

-

-

-

0

-

100

C

-

-

-

0

-

-

D

-

0

0

-

0

-

E

0

-

-

0

-

100

F

-

0

-

-

0

100

A minus sign indicates that two states cannot be converted to each other.

Q-learning

In the front we set up an environmental model and a return system. In this section we will explain the learning algorithm q-learning (it is a simplification of reinforcement learning).

We build an environmental return system matrix R


Now we put a matrix-like q into the brain of the robot, Q will save the environment information that the robot obtains by walking. The wardrobe of the matrix Q represents the state of the robot, and the column header of Q represents the next transition state that the behavior points to.

At first, the robot knew nothing about environmental information. So q is a 0 matrix. In this example, to facilitate the explanation, we assume that the number of States is known (the number of States is 6), and in general we can initialize the Q matrix to an empty matrix (q=[]). The rows and columns of the Q matrix are added later when a new state is found.

The conversion rules for q-learning are as follows:


This formula means that the matrix Q equals the instantaneous return of the current state, plus the maximum Q value obtained from all possible actions to reach a corresponding state is the product of gamma, here γ is the learning parameter (0<=γ<1).

Q-learning algorithm

The robot is unsupervised (unsupervised learning). The robot searches through a state to a state until it reaches the target node. Here we call the start node search to reach the target node as a episode, and when a episode ends, the program goes to the next episode. Please click on the proof of convergence (see references for proof).

Q-learning Summary

Assume that the state transition diagram has a target state (represented by the Matrix R).

Search: The shortest path from the starting state to the target State (represented by the matrix Q).

Q Learning algorithm goes as follow Set parameter, and environment reward matrix R Initialize matrix Q as zero matrix Fo R each Episode:select random initial state does while not reach goal state Select one among all possible actions for the CU Rrent state Using This possible action, consider to go to the next state Get maximum Q value of this next state based on a ll possible actions Compute Set the next state as the

End do

End for

The above algorithm is used for robot learning or training. Each episode is equivalent to one training. Each time the robot is trained to obtain environmental information (expressed in matrix R), the return value (or not) is obtained until the target state is reached. The purpose of training is to enrich the information of the brain of the robot (expressed in matrix Q), and more training will produce a better Q matrix to guide the robot to get an optimal path. If the Q matrix is enhanced instead of exploring the perimeter and walking back and forth in the same room, the robot will find the quickest route to reach its target state.

γ is between [0,1], when gamma approaches 0 o'clock, the robot tends to consider only instantaneous return, if Gamma is close to 1, the robot will consider the future return with a larger proportion, that is, deferred return.

Using the Q matrix, the robot knows the target State from its initial state and keeps track of the state sequence. The algorithm simply finds a behavior that allows the current state to get the maximum Q.

Specific steps:

We assume that Gamma is 0.8. Initial state is B, first set q is the zero matrix.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.