TensorFlow----Intensive Learning

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://blog.sina.com.cn/s/blog_9409e4a3010137gm.html

Environmental modelling

Suppose there are 5 rooms in a building connected by a door as shown in the figure below, we give 5 rooms in turn named A~e. Now consider our standing outside of a large room F, and the room F covers all the rest of the building space. That is, F can enter the B or e room.

We can represent each room as a node, each door as an edge.

We want to get to a target room. If you put an agent into any room, we hope the agent can get out of the building. In other words, the target room is f. Here we introduce a return value for each door (e.g., side length in the figure). If a door can reach the target in an instant, the return value is 100 (as shown in the figure below, as shown by the Red Arrow), and the other door cannot reach F directly, and the return value is 0. Since the door is bidirectional (a can reach E, from E can also reach a), each of the two nodes has two opposite arrows pointing, each with an instantaneous return value. The diagram becomes a state-like diagram, as shown in the following figure.

Note that the F-node has a maximum return value from the loop (F back to F), so if an agent arrives at the target node, it will always stop here and we call it the attraction node (absorbing goal). Because the agent reaches this target state, it will maintain this state.

Agent, status (state), and Behavior (action) Introduction

Suppose our agent is a robot and can learn by walking. The robot went into another room from one room without knowing the environment. It is even more unknown if you choose which door to walk out of the building.

Suppose we want to model some kind of simple evacuation of a agent from any of the the building. Now let's assume that the robot is in C at the moment and hopefully the robot can go to f by learning.

How to keep robots learning from walking.

Before we talk about robotic learning (q-learning), let's look at the term State and behavior (action).

We use each room (including the target room) as state. A robot moving from one room to another is called an action. Recall the status of the above to replace the picture. The state is described by the node of the status graph, and the action is represented by an arrow.

Assuming the robot is now Statec, from Statec, the robot can only reach stated. From stated can reach Stateb,statee or back to Statec, if in Statee, the robot can have three kinds of action, that is, to Statea,statef or stated, if the robot in Stateb, it can reach Statef, You can also reach stated.

We will now put the status-loading chart and the instantaneous return value into the return table R

	Action to go to state
Agent now on State	A	B	C	D	E	F
A	-	-	-	-	0	-
B	-	-	-	0	-	100
C	-	-	-	0	-	-
D	-	0	0	-	0	-
E	0	-	-	0	-	100
F	-	0	-	-	0	100

A minus sign indicates that two states cannot be converted to each other.

Q-learning

In the front we set up an environmental model and a return system. In this section we will explain the learning algorithm q-learning (it is a simplification of reinforcement learning).

We build an environmental return system matrix R

Now we put a matrix-like q into the brain of the robot, Q will save the environment information that the robot obtains by walking. The wardrobe of the matrix Q represents the state of the robot, and the column header of Q represents the next transition state that the behavior points to.

At first, the robot knew nothing about environmental information. So q is a 0 matrix. In this example, to facilitate the explanation, we assume that the number of States is known (the number of States is 6), and in general we can initialize the Q matrix to an empty matrix (q=[]). The rows and columns of the Q matrix are added later when a new state is found.

The conversion rules for q-learning are as follows:

This formula means that the matrix Q equals the instantaneous return of the current state, plus the maximum Q value obtained from all possible actions to reach a corresponding state is the product of gamma, here γ is the learning parameter (0<=γ<1).

Q-learning algorithm

The robot is unsupervised (unsupervised learning). The robot searches through a state to a state until it reaches the target node. Here we call the start node search to reach the target node as a episode, and when a episode ends, the program goes to the next episode. Please click on the proof of convergence (see references for proof).

Q-learning Summary

Assume that the state transition diagram has a target state (represented by the Matrix R).

Search: The shortest path from the starting state to the target State (represented by the matrix Q).

Q Learning algorithm goes as follow Set parameter, and environment reward matrix R Initialize matrix Q as zero matrix Fo R each Episode:select random initial state does while not reach goal state Select one among all possible actions for the CU Rrent state Using This possible action, consider to go to the next state Get maximum Q value of this next state based on a ll possible actions Compute Set the next state as the

End do

End for

The above algorithm is used for robot learning or training. Each episode is equivalent to one training. Each time the robot is trained to obtain environmental information (expressed in matrix R), the return value (or not) is obtained until the target state is reached. The purpose of training is to enrich the information of the brain of the robot (expressed in matrix Q), and more training will produce a better Q matrix to guide the robot to get an optimal path. If the Q matrix is enhanced instead of exploring the perimeter and walking back and forth in the same room, the robot will find the quickest route to reach its target state.

γ is between [0,1], when gamma approaches 0 o'clock, the robot tends to consider only instantaneous return, if Gamma is close to 1, the robot will consider the future return with a larger proportion, that is, deferred return.

Using the Q matrix, the robot knows the target State from its initial state and keeps track of the state sequence. The algorithm simply finds a behavior that allows the current state to get the maximum Q.

Specific steps:

We assume that Gamma is 0.8. Initial state is B, first set q is the zero matrix.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

TensorFlow----Intensive Learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

TensorFlow----Intensive Learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support