The previous article introduced the Monte Carlo Situation Assessment.AlgorithmHow to use this algorithm to implement go gamesProgramWhat about it?The easiest way to think of it is to use the Monte Carlo situation evaluation algorithm to evaluate the situation after each of the following points for a given situation, so as to select the best starting point. This is feasible, but is there anything that can be optimized?
If you are a CPU, you know the go rules, but you do not know the higher level of Go knowledge. Instead, you can simulate a random match. How do you choose the game in the face of a huge chessboard? You will simulate the evaluation for 10 thousand times one by one, and finally compare the evaluation value to select the most advantageous one...
Let's start with go and introduce a "Multi-arm gangster problem". The problem is described as follows: A Multi-arm gangster can be seen as a multi-arm corner machine in a casino, and each corner machine has an unknown rate of return, the return rates of different angle machines are independent of each other. Given a limited number of attempts, I would like to ask how to get the highest return from these corner machines? This is a typical model weighing the exploration and exploitation in machine learning, which has been carefully studied in statistics.
After trying some corner machines, we naturally think of trying out those high-return corner machines. However, this is easily confined to existing experience, rather than more exploration. It is likely to miss those with higher returns, so we should try those with fewer attempts, to obtain more accurate information.
The UCB algorithm tries to find the balance between the two actions. The UCB algorithm uses the current average income value of a corner machine as the base number. The base number is the sum of the adjusted value and the UCB value. Each time you try a corner machine with the largest UCB value, this adjustment value decreases as the number of attempts made to the corner machine increases. The formula for calculating the UCB value is as follows:
XJ is the current average benefit value of J corner machines, n is the number of attempts on all corner machines, and TJ (n) is the number of attempts on J corner machines.
The right side of the plus sign is the adjustment value of the UCB algorithm, which is easy to get. The smaller the adjustment value, the more likely the corner machine to be tried.
As for how this formula came about, it was not something I could understand... As an engineer, you can use the results of scientists ~
Go back to go, and go back to the previous assumption. You are a CPU. How can you choose from this question? Using the UCB algorithm is certainly a good answer! You can view each of the bottom points on the current disk as a corner machine. Each time you perform Monte Carlo evaluation on the lowest points with the largest UCB value, you can quickly find reliable starting points.
Wait, are there repeated computations? Can optimization be continued?