' Thanks R. S. Sutton and A. G Barto for their great work in reinforcement Learning:an Introduction. eligibility traces in prediction problems
In theBackward Viewof TD (Λ) TD (\LAMBDA), there is a memory variable associated withEligibility Trace. The eligibility trace for state S S in time t T is a random variable denoted Zt (s) ∈r+ Z_{t} (s) \in\mathbb{r}^{+}. On each step, the eligibility traces forAll statesDecay Byγλ\gamma\lambda, and the eligibility trace for theOne State visitedOn the step are incremented by 1:
Zt (s) ={γλzt−1 (s) γλzt−1 (s) +1s≠sts=st Z_{t} (s) =\left\{\begin{aligned} &\gamma\lambda Z_{t-1} (s) &s\neq S_{t}\ \ &\gamma\lambda Z_{t-1} (s) +1 &s=s_{t}\end{aligned}\right.
For all nonterminal states S S, Whereγ\gamma are the discount rate Andλ\lambda are theΛ\lambda-return1 parameter orTrace-decay parameter. This kind of eligibility trace was called anaccumulating Trace. The global TD error signal triggers proportional updates to all recently visited states, as signaled by their nonzero trac Es:
ΔVT (s) =αδtzt (s), ∀s∈s \delta v_{t} (s) =\alpha\delta_{t}z_{t} (s), \forall S\in\mathcal{s}
where
ΔT=RT+1+ΓVT (st+1