First, some concepts
Markov properties: The current moment state is only related to the previous moment.
The state transition matrix shows the conditional probability of arbitrary state A to State B.
Markov process (Markov chain): A Markov-type stochastic process with no memory, containing n states.
The Markov excitation process (s,p,r,γ) is a Markov chain with value.
Use GTGT to denote the total return of the t moment. For reasons such as mathematical calculations and the avoidance of Nan Infinity, a discount factor, γ∈[0,1]γ∈[0,1, is introduced to make trade-offs between rewards and further rewards for the next moment. (If all sequences end in a finite step and are strategically appropriate, gamma can also take 1.) ) gt=rt+1+γrt+2+...=∑k=0∞γkrt+k+1gt=rt+1+γrt+2+...=∑k=0∞γkrt+k+1
The value function V (s), in the Markov excitation process (MRP), represents the expectation of return obtained in the specified state. is computed by all the sample sequences that contain the state. Where Rsrs is an immediate reward, it can be considered a reward for leaving the state S. V (s) =e[gt| St=s]=e[rt+1+γ (V (st+<