Enhancement Learning reinforcement learning classical algorithm combing 3:TD method

Source: Internet
Author: User

1 Preface

In the previous blog, we analyzed the Monte Carlo method, a feature of which is the need to run a complete episode to obtain accurate results. But often a lot of scenes to run a full episode is very time-consuming, so can you still follow the path of Bellman equation, estimate the result? Also, note here, still model free. So what is the way to do it? is the TD (Temporal-difference time difference) method.

There is a noun to note: boostraping. The so-called boostraping is that there is no estimation method to guide the calculation. Then Monte Carlo does not use boostraping, while TD uses Boostraping.

Next, specifically analyze the TD method

2 TD differs from MC

The MC uses the exact return to update value, while TD uses the estimation method of value in the Bellman equation to estimate value, and then updates the estimate as the target value of value.

Therefore, the estimated target value of the set will be derived from a variety of TD algorithm.

So what are the advantages of the TD approach?

    • Each step can be updated, which is obviously, that is, online learning, learning fast
    • Can face no results of the scene, a wide range of applications

The disadvantage is also obvious, that is because the TD target is an estimate, it is estimated that there is an error, which will cause the update to get the value is biased. It is difficult to achieve unbiased estimation. But at the same time, TD target is each step to estimate, only the recent action has an impact on it, and the result of MC is affected by the action of the whole time slice, so TD target variance variance will be relatively low, that is, volatility is small.

Let's take a look at David Silver's summary:

So there are three pictures in the ppt of David Silver, which clearly contrasts the difference between MC,TD and DP:


From the above can be very clear to see the difference between the three. DP is an idealized situation, traversing all. MC Reality, TD is the most realistic, but TD is also the most inaccurate. But it's okay, under repeated iterations, it can be convergent.

The entire enhancement learning algorithm is also in the above category:

3 TD algorithm

This is only the TD (0) Estimate method, obviously can expand to n-step. That is to say Td-target again according to Bellman equation unfold.

Again, the idea is that TD (I) and TD (J) can be combined together to find an average.

The next step is to calculate the TD (I) can be counted once, each given a coefficient, the sum of 1, which is d ( λ )

4 Sarsa algorithm

The idea of the SARSA algorithm is simple, which is to add a, next, a, and then estimate q ( s a ) 。

The algorithm is called Sarsa, which means that the update needs to use these 5 quantities.

5 q-learning algorithm

The famous q-learning.

This is updated directly using the largest Q.

Why say Sarsa is On-policy and q-learning is Off-policy?
Because Sarsa only estimates policy, Q-learning's Q is the way to the best.

6 Double q-learning

Q-learning There may be an over-estimation of Q values, Double q-learning can solve the problem:

Alternating updates with two Q.

7 More ways to compare

By the above two figure can understand Td,sarsa, and q-learning algorithm source, is essentially based on Bellman equation.

It can be understood that the Bellman equation is a solution to the ideal condition, and these methods are the achievable methods that are formed by abandoning the ideal accuracy.

Summary

This paper combs several TD-related algorithms. TD Algorithms in particular t d ( λ ) The method leads to the eligibility trace (the translation does not know whether the qualification trail), this part of the content to be analyzed later.

Statement

The pictures of this article are captured from:
1 Reinforcement Learning:an Introduction
2 Reinforcement Learning Course by David Silver

Enhancement Learning reinforcement learning classical algorithm combing 3:TD method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.