1 Introduction
The process of Alphago Zero (hereinafter referred to as zero) is shown in Figure A, B, in each state s, through MCTs search, to obtain the probability p of each possible move, where MCTs search adopts Self-play and executes the fθ strategy. Fθ mainly uses Microsoft's ResNet, that is, based on the residual learning. After using MCTs to obtain the probability p of each possible move, update the fθ weight. Finally, use this fθ to evaluate the probability that the game will be won at last.
2 MCTS
Each node S (state) contains several edges that represent the action A∈a (s) that can be performed in the state S. Each edge Edge stores the following information
{N (S, a), W (S, a), Q (S, a), P (S, a)}
where n (S, a) is the number of accesses, W (s, a) is the total Q value obtained for the execution of Action A, Q (S, a) is the average Q value obtained by executing action A, P (s, a) is a priori probability obtained by fθ.
(1) Select
For the leaf node SL, from the root node to the current node, the policy is:.
which
Cpuct is a constant, controlling the scope of exploration.
(2) Expand and evaluate
For the leaf node to be extended, swap or rotate the parent node s left or right, initializing the following values:
{N (SL, a) = 0, W (SL, a) = 0, Q (SL, a) = 0, P (SL, a) = PA}, and then begin the simulation to calculate the Q value of the leaf node, W (SL, a) =w (SL, a) +v, Q (SL, a) =w (SL, a)/n (SL, a) a).
(3) Pruning
If the Q value of the node and its best child nodes is less than a certain threshold, it will be pruned and no longer explored. 3 Neural Network Architecture
Input characteristics: Historical characteristics + own current characteristics + opponent current characteristics
The neural network uses Microsoft's resnet structure. A two-level ResNet learning module is used.
Zero uses 19 or 39 layer ResNet
Two-layer residual error module of ResNet
The "Weight layer" in the two-layer residual module of resnet corresponds to the convolution module on zero. The convolution module is specific:
(1) A convolution of 256 filters of kernel size 3x3 with stride 1
(2) Batch Normalization
(3) A rectifier nonlinearity