Enhancement Learning reinforcement learning classical algorithm combing 3:TD method

Last Update:2016-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Preface

In the previous blog, we analyzed the Monte Carlo method, a feature of which is the need to run a complete episode to obtain accurate results. But often a lot of scenes to run a full episode is very time-consuming, so can you still follow the path of Bellman equation, estimate the result? Also, note here, still model free. So what is the way to do it? is the TD (Temporal-difference time difference) method.

There is a noun to note: boostraping. The so-called boostraping is that there is no estimation method to guide the calculation. Then Monte Carlo does not use boostraping, while TD uses Boostraping.

Next, specifically analyze the TD method

2 TD differs from MC

The MC uses the exact return to update value, while TD uses the estimation method of value in the Bellman equation to estimate value, and then updates the estimate as the target value of value.

Therefore, the estimated target value of the set will be derived from a variety of TD algorithm.

So what are the advantages of the TD approach?

Each step can be updated, which is obviously, that is, online learning, learning fast
Can face no results of the scene, a wide range of applications

The disadvantage is also obvious, that is because the TD target is an estimate, it is estimated that there is an error, which will cause the update to get the value is biased. It is difficult to achieve unbiased estimation. But at the same time, TD target is each step to estimate, only the recent action has an impact on it, and the result of MC is affected by the action of the whole time slice, so TD target variance variance will be relatively low, that is, volatility is small.

Let's take a look at David Silver's summary:

So there are three pictures in the ppt of David Silver, which clearly contrasts the difference between MC,TD and DP:

From the above can be very clear to see the difference between the three. DP is an idealized situation, traversing all. MC Reality, TD is the most realistic, but TD is also the most inaccurate. But it's okay, under repeated iterations, it can be convergent.

The entire enhancement learning algorithm is also in the above category:

3 TD algorithm

This is only the TD (0) Estimate method, obviously can expand to n-step. That is to say Td-target again according to Bellman equation unfold.

Again, the idea is that TD (I) and TD (J) can be combined together to find an average.

The next step is to calculate the TD (I) can be counted once, each given a coefficient, the sum of 1, which is d ( λ )

4 Sarsa algorithm

The idea of the SARSA algorithm is simple, which is to add a, next, a, and then estimate q ( s a ) 。

The algorithm is called Sarsa, which means that the update needs to use these 5 quantities.

5 q-learning algorithm

The famous q-learning.

This is updated directly using the largest Q.

Why say Sarsa is On-policy and q-learning is Off-policy?
Because Sarsa only estimates policy, Q-learning's Q is the way to the best.

6 Double q-learning

Q-learning There may be an over-estimation of Q values, Double q-learning can solve the problem:

Alternating updates with two Q.

7 More ways to compare

By the above two figure can understand Td,sarsa, and q-learning algorithm source, is essentially based on Bellman equation.

It can be understood that the Bellman equation is a solution to the ideal condition, and these methods are the achievable methods that are formed by abandoning the ideal accuracy.

Summary

This paper combs several TD-related algorithms. TD Algorithms in particular t d ( λ ) The method leads to the eligibility trace (the translation does not know whether the qualification trail), this part of the content to be analyzed later.

Statement

The pictures of this article are captured from:
1 Reinforcement Learning:an Introduction
2 Reinforcement Learning Course by David Silver

Enhancement Learning reinforcement learning classical algorithm combing 3:TD method

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Enhancement Learning reinforcement learning classical algorithm combing 3:TD method

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Enhancement Learning reinforcement learning classical algorithm combing 3:TD method

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support