Machine learning Algorithms Study Notes (5)-reinforcement Learning

Last Update:2015-10-31 Source: Internet

Author: User

Tags spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reinforcement Learning

The solution to the problem of control decision: to design a return function (reward functions), if the learning agent (such as the above four-legged robot, chess AI program) in the decision of a step, to obtain a better result, Then we give the agent some return (such as the return function result is positive), get poor results, then the return function is negative. For example, a quadruped robot, if he moves a step forward (close to the target), then the return function is positive and the back is negative. If we can evaluate each step and get the corresponding return function, then it's good to do it, we just need to find a path with the largest return value (the sum of the returns per step), which is considered the best path.

Enhanced learning has been successfully applied in many fields, such as automatic helicopters, robot control, mobile network routing, market decision making, industrial control, and efficient web indexing. The reinforcement learning in this section begins with the Markov decision process (Mdp,markov decision processes).

Markov Decision Processes

Markov decision making process is composed of a five-tuple

s represents the state set (states). (for example, in an automated helicopter system, the helicopter's current position coordinates constitute a state set)
A represents a set of actions (actions). (for example, use the joystick to steer the helicopter in the direction of the flight, so that it forwards, backwards, etc.)
is the probability of a state transition. The transition of one State in S to another State requires a to participate. Represents the probability distribution of the other states that have been transferred to in the current state, after a function has been performed (the current state may jump to many states after executing a).
is the damping coefficient (discount factor)
, R is the return function (reward functions), and the return function often writes s functions (only with s), so that R is re-written.

The dynamic process of MDP is as follows: The initial state of an agent is, then pick an action from a to execute, after execution, the agent is randomly transferred to the next state by probability. Then perform an action, move on, and then execute ... we can use the following diagram to represent the entire process.

We define the sum of the return functions that go through the above transfer path as follows

If R is only related to S, then the upper can be written

Our goal is to select the best set of actions, so that all returns are weighted and expected the most.

From the above can be found that in the t moment of the return value is played a discount, is a gradual decay process, the more the back of the state on the return and impact of the smaller. Maximize the expected value is to put the big as far as possible to the front, small as far as possible to put on the back.

When it is already in a certain state, we will select the next action A to execute with a certain strategy and then transition to another State s '. We call this action selection process a policy, and each policy is actually a mapping function of State to action. Given is given, that is, knowing the next step in each state of the action should be performed.

In order to distinguish between good and bad, and defined in the current state, after the execution of a policy, the result is good or bad, you need to define the value functions (value function) is also called the conversion cumulative return (discounted cumulative reward)

As you can see, in the current state S, after choosing a good policy, the value function is the return weighting and expectation. This is actually very easy to understand, given also given a future action plan, this action plan will go through a state, and arrive in each state will have a certain return value, the closer the current state of the other State to the impact of the scheme, the higher the weight. This is similar to chess, under the current chess game, different sub-programs are, we evaluate each programme depends on the future situation (,,... ) of judgment. In general, we will consider a few more steps in our mind, but we will focus more on the next situation.

From the point of view of recursion, the value function of the current state S V, in fact, can be regarded as the return of the present state of R (s) and the value of the next state of the function V ' the sum, that is, to change the formula:

However, we need to note that although given, under a given state S, a is unique, but may not be more than one mapping. For example, if you choose a to throw a dice forward, then there may be 6 of the next state. Again by the Bellman equation, obtained from the upper form

S ' indicates the next state.

The front R (s) is called immediate return (immediate reward), which is R (the current state). The second item can also be written, which is the expectation of the next state value function, and the next state s ' conforms to the distribution.

As you can imagine, when the number of States is limited, we can find the V of each s by means of the above (the end state does not have a second (s)). If a linear equation group is listed, that is | s| a equation, | s| an unknown, direct solution can.

Of course, we seek the purpose of V is to find a current state s under the optimal action strategy, the definition of the optimal v* is as follows:

is to pick an optimal strategy from an optional strategy (discounted rewards Max).

The Bellman equation form for the above formula is as follows:

The first item is irrelevant, so it does not change. The second is a decision on the next action of each State S, a, after performing a, s ' probability distribution of the probability of return and expectations.

If the above is not easy to understand, you can refer to:

Defining the optimal v*, we then define the optimal strategy as follows:

By selecting the optimal one, the next optimal action of each state S is determined.

According to the above, we can know

Explain that the optimal value function for the current state is v*, which is obtained by adopting the optimal execution strategy, and the return of the optimal execution scheme is obviously better than the other execution strategy.

It is important to note that if we are able to get the optimal a for each s, then the mapping from the global perspective can be generated, and the resulting mapping is the optimal mapping, called. For the global S, the next action A for each S is determined, and will not vary depending on the initial state S selection.

Value Iteration and policy Iteration

This section discusses two effective algorithms for solving the specific strategy of a finite state MDP. Here, we only aim at the case that MDP is finite state, finite action.

Value Iteration Method

Initializes each S's V (s) to 0
Loop until convergence {
For each state s, update to V (s)
}

The value iteration strategy utilizes the Bellman equation(2) in the previous section.

The implementation of the inner loop has two strategies:

1) Synchronous iterative Method

For the first iteration after initialization, all V (s) in the initial state is 0. Then the new V (s) =r (s) +0=r (s) is computed for all S. When each state is computed, a new V (s) is obtained, which is saved and not immediately updated. After all the new values of s (s) are computed, the update is unified.

Thus, after the first iteration, V (s) =r (s).

2) Asynchronous iterative method

corresponding to the synchronous iteration is the asynchronous iteration, for each state s, gets a new V (s), not stored, directly updated. Thus, after the first iteration, most of the V (s) >r (s).

Whichever of these two types is used, eventually V (s) converges to v* (s). Knowing the v*, we use the formula (3) to find out the corresponding optimal strategy, of course, can be found in the process of seeking v*.

Strategy Iteration Method

The value iteration method converges the V value to v*, while the strategy iteration method focuses on the convergence.

A mapping of s to a is randomly assigned.
Loop until convergence {
1. Make
2. For each state s, do update to
}

(a) The V in step can be obtained by the previous Bellman equation

This step will find out all the state S.

(b) The step is actually to pick out the current state s under the (a) step, the optimal a, and then do the update.

For value iterations and policy iterations it's hard to say which method is good and which is bad. For a smaller MDP, strategies tend to converge faster. However, for MDP with large scale (many states), it is easier to iterate over values (not to ask for a linear equation group).

Learning a model for an MDP

In the MDP discussed earlier, we are known for the state transition probability and the return function R (s). But in many practical problems, these parameters cannot be explicitly obtained, we need to estimate these parameters from the data (usually s, a and is known).

Let's say we know that many of the state transition paths are as follows:

Wherein, is the I moment, the state of the section J transfer path corresponds to the state of the action to be performed. The number of states in each transfer path is limited, and during the actual operation, each transfer chain enters the final state, or the specified number of steps is terminated.

If we get a lot of similar transfer chains (equivalent to a sample), then we can use the maximum likelihood estimate to estimate the state transition probability.

The number of times that a molecule is reached S ' after performing an action a from S state, and the denominator is the number of times a is executed at state S. The division of the two is the probability that after a is executed in S state, it will be transferred to S '.

To avoid a denominator of 0, we need to do a smoothing. If the denominator is 0, then, that is, when the sample does not appear in the S state to execute a, we consider the transfer probabilities to be evenly divided.

The above estimation method is estimated from historical data, and the same formula applies to online updates. For example, if we get some new transfer paths, we can modify the numerator denominator of the above formula (plus the newly obtained count). After the correction, the transfer probability has changed, according to the probability of change, there may be more new transfer path, this will be more and more accurate.

Similarly, if the return function is unknown, then we think R (s) is the average value of the return that has been observed in the S state.

When the transfer probability and return function is estimated, we can use the value iteration or the strategy iteration to solve the MDP problem. For example, the process of combining parameter estimation with value iterations (without knowing the probability of state transitions) is as follows:

Random initialization
Loop until convergence {
1. Each State transfer count in the sample is used to update and R
2. Use the estimated parameters to update V (using the value iteration method of the previous section)
3. According to the updated V to re-draw

}

In step (b) We are going to do a value update, which is also a loop iteration, in the previous section we solved v by initializing v to 0 and then iterating. After nesting to the above procedure, if you initialize v at 0 each time and then iterate over the update, it will be slow. One way to speed up is to initialize V to the V that was obtained in the last cycle. This means that the initial value of V is linked to the previous result.

Reference documents

[1] machine learning Open Class by Andrew Ng in Stanford http://openclassroom.stanford.edu/MainFolder/CoursePage.php? Course=machinelearning

[2] Yu Zheng, Licia Capra, Ouri Wolfson, Hai Yang. Urban computing:concepts, methodologies, and applications. ACM Transaction on Intelligent Systems and technology. 5 (3), 2014

[3] Jerry lead http://www.cnblogs.com/jerrylead/

[3] Big data-massive data mining and distributed processing on the internet Anand Rajaraman,jeffrey David Ullman, Wang Bin

[4] UFLDL Tutorial http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

[5] Spark Mllib's naive Bayesian classification algorithm http://selfup.cn/683.html

[6] mllib-dimensionality Reduction http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html

[7] Mathematics in machine learning (5)-powerful matrix singular value decomposition (SVD) and its application http://www.cnblogs.com/LeftNotEasy/archive/2011/01/19/svd-and-applications.html

[8] Discussion on the implementation of linear regression algorithm in Mllib http://www.cnblogs.com/hseagle/p/3664933.html

[9] Maximum likelihood estimation http://zh.wikipedia.org/zh-cn/%E6%9C%80%E5%A4%A7%E4%BC%BC%E7%84%B6%E4%BC%B0%E8%AE%A1

[Ten] deep learning Tutorial http://deeplearning.net/tutorial/

Cedar

Microsoft MVP

Machine learning Algorithms Study Notes (5)-reinforcement Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More