We already know that UCBAlgorithmWe can quickly find a reliable starting point and continue the previous article. Can we optimize it?
First, we need to know why the UCB algorithm converges faster than the blind Monte Carlo evaluation?In my understanding, the reason is that during the Algorithm Execution Process, The UCB algorithm can constantly adjust the policy based on the previous results and select which of the following points can be prioritized. In fact, this is an online machine learning strategy. The UCB algorithm can be used to solve the problem of multi-arm gangsters mentioned in the previous article. Compared with the simple Monte Carlo evaluation method, the UCB algorithm significantly improves the convergence speed, but further optimization is possible.
In the previous article, the lower points on the go board were compared to corner machines, but what are the differences between them?
The answer is: the multi-arm gangsters only have one level of corner machines. The go game is a game tree composed of multiple levels of corner machines!
The game tree has finally been mentioned. You need to know that the game tree is an essential tool for most chess games. Here is a brief introduction to the most basic search method used in game tree search-maximum and minimum search (if you have no idea about the game tree, Google it yourself ). In a two-person zero-sum game, every decision made by both parties involved in the game is to maximize their own interests (nonsense ~). Suppose we set the game situation when the black player wins to a positive value V, while the white player wins by-V (in other cases, the situation is between V and-V ), every player in the black game aims to make the situation as big as possible. As shown in the game tree, the black game layer always selects the node with the highest situation value as the result to return the previous layer. The white game layer is opposite-this is the maximum and minimum search.
With the above Popular Science, here we will give the answer in the previous article-a more optimized algorithm, uct algorithm (UCB for tree ). The following is an algorithm description:
Given a game tree.
1) Search down from the root point of the game tree and execute 2 ).
2) If node A has a child node that has never been evaluated, execute 3); otherwise, execute 4 ).
3) Evaluate the subnode by using the Monte Carlo method, obtain the benefit value, and then update the average benefit value of all nodes from the subnode to the Root Node path. Execute 1 ).
4) Calculate the UCB value of each subnode, and use the subnode with the highest UCB value as node A. Execute 2 ).
5) the algorithm can be terminated at any time, usually after a specified time or number of attempts.
The child node with the highest average return value under the root node serves as the output of the algorithm.
There are several points to explain about this algorithm:
1) The root node of the game tree refers to the current situation.
2) The evaluated nodes and their average income values areProgramSaves and updates during running.
3) You can set an appropriate value for the benefit value. I know that Mogo is set to 1 (WIN) or 0 (negative), and my program foolish go is to get the region/total region.
4) This algorithm is the cornerstone of the modern go game program.
My personal understanding of this algorithm is that it is essentially an iterative DFS.
In order to better understand the convergence of the uct algorithm, you may wish to think about how this algorithm will "recognize" the question of how to seek a sub-player in the next game?
The theory is almost finished, and the next article will go into practice.