A
The algorithm is based on a probability to the exploration and use of the compromise: each attempt to explore the probability, that is, the probability of the uniform probability of selecting a rocker arm, in order to take advantage of the probability of selecting the current average reward the highest rocker arm (if there are multiple, then randomly selected).
Where: small k represents the K rocker arm. Because the large k represents the total number of rocker arms, n indicates the number of attempts, and VN represents the reward for the nth attempt.
The intuitive meaning of QN is: The average reward for the previous n-1 times. When it is multiplied with n-1, it is the former n-1 total reward. Plus the nth reward, N is the average reward for n times.
Among them: Argmax for the selection of the best Q (i). Count is starting from 0, so the value of Count (k) +1 is n, and the calculated Q (k) is the average reward for n times.
(ii) SOFTMAX algorithm
The Softmax algorithm is a compromise between exploration and utilization based on the currently known average swing-arm reward. If the average reward of each rocker is equal, the probability of selecting each rocker arm is also equal, and if the average reward of some probabilities is significantly higher than other rewards, the probability of their being chosen is also significantly higher.
In the greedy algorithm, the value is selected by the user. The distribution of rocker-arm probabilities in the Softmax algorithm is based on the Boltzmann distribution.
< search >boltzmann Distribution
Does not see the use of Botlzmann distribution from the algorithm?
The choice of two algorithms depends on the actual situation. As seen from the Softmax, when the temperature is =0.01, the curve is almost coincident with the "Use only" curve.
Enhanced learning greedy algorithm and Softmax algorithm