Introduction to reinforcement learning algorithms (reinforcement learning and Control)

Source: Internet
Author: User

In the previous discussion, we always given a sample x and then gave or didn't give the label Y. The samples are then fitted, classified, clustered, or reduced to a dimension. However, for many sequence decisions or control problems, it is difficult to have such a regular sample. For example, the four-legged robot control problem, at first did not know should let it move that leg, in the process of movement, also do not know how to let the robot automatically find the right direction.

In addition, if you want to design a chess ai, each step is actually a decision-making process, although for simple chess has a * heuristic method, but in the complex situation, still need to let the machine to consider a few more steps behind to decide which step is better, so need better decision-making method.

There is such a way of thinking about this kind of control decision. We design a return function (reward functions), if the learning agent (such as the above four-legged robot, chess AI program) after the decision to obtain a better result, then we give agent some return (such as return function result is positive), get poor results , then the return function is negative. For example, a quadruped robot, if he moves a step forward (close to the target), then the return function is positive and the back is negative. If we can evaluate each step and get the corresponding return function, then it's good to do it, we just need to find a path with the largest return value (the sum of the returns per step), which is considered the best path.

Enhanced learning has been successfully applied in many fields, such as automatic helicopters, robot control, mobile network routing, market decision making, industrial control, and efficient web indexing.

Next, let's introduce the Markov decision process (Mdp,markov decision processes). 1. Markov decision-making process

A Markov decision-making process consisting of a five-tuple

* S represents the state set (states). (for example, in an automated helicopter system, the helicopter's current position coordinates constitute a state set)

* A represents a set of actions (actions). (for example, use the joystick to steer the helicopter in the direction of the flight, so that it forwards, backwards, etc.)

* is the probability of a state transition. The transition of one State in S to another State requires a to participate. Represents the probability distribution of the other states that have been transferred to in the current state, after a function has been performed (the current state may jump to many states after executing a).

* is the damping coefficient (discount factor)

*, R is the return function (Reward function), and the return functions often write s functions (only with s), so that R is re-written.

The dynamic process of MDP is as follows: The initial state of an agent is, then pick an action from a to execute, after execution, the agent is randomly transferred to the next state by probability. Then perform an action, move on, and then execute ... we can use the following diagram to represent the entire process.

If you have a good understanding of hmm, it's easier to understand.

We define the sum of the return functions that go through the above transfer path as follows

If R is only related to S, then the upper can be written

Our goal is to select the best set of actions, so that all returns are weighted and expected the most.

From the above can be found that in the t moment of the return value is played a discount, is a gradual decay process, the more the back of the state on the return and impact of the smaller. Maximize the expected value is to put the big as far as possible to the front, small as far as possible to put on the back.

When it is already in a certain state, we will select the next action A to execute with a certain strategy and then transition to another State s '. We call this action selection process a policy, and each policy is actually a mapping function of State to action. Given is given, that is, knowing the next step in each state of the action should be performed.

In order to distinguish between good and bad, and defined in the current state, after the execution of a policy, the result is good or bad, you need to define the value functions (value function) is also called the conversion cumulative return (discounted cumulative reward)

As you can see, in the current state S, after choosing a good policy, the value function is the return weighting and expectation. This is actually very easy to understand, given also given a future action plan, this action plan will go through a state, and arrive in each state will have a certain return value, the closer the current state of the other State to the impact of the scheme, the higher the weight. This is similar to chess, under the current chess game, different sub-programs are, we evaluate each programme depends on the future situation (,,... ) of judgment. In general, we will consider a few more steps in our mind, but we will focus more on the next situation.

From the point of view of recursion, the value function of the current state S V, in fact, can be regarded as the return of the present state of R (s) and the value of the next state of the function V ' the sum, that is, to change the formula:

However, we need to note that although given, under a given state S, a is unique, but may not be more than one mapping. For example, if you choose a to throw a dice forward, then there may be 6 of the next state. Again by the Bellman equation, obtained from the upper form

S ' indicates the next state.

The front R (s) is called immediate return (immediate reward), which is R (the current state). The second item can also be written, which is the expectation of the next state value function, and the next state s ' conforms to the distribution.

As you can imagine, when the number of States is limited, we can find the V of each s by means of the above (the end state does not have a second (s)). If a linear equation group is listed, that is | s| a equation, | s| an unknown, direct solution can.

Of course, we seek the purpose of V is to find a current state s under the optimal action strategy, the definition of the optimal v* is as follows:

is to pick an optimal strategy from an optional strategy (discounted rewards Max).

The Bellman equation form for the above formula is as follows:

The first item is irrelevant, so it does not change. The second is a decision on the next action of each State S, a, after performing a, s ' probability distribution of the probability of return and expectations.

If the above is not easy to understand, you can refer to the following image:

Defining the optimal v*, we then define the optimal strategy as follows:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.