Alphago Zero thesis Chinese version: Mastering the game of Go without human knowledge_ deep learning

Source: Internet
Author: User
Tags scalar
Turn from: http://blog.csdn.net/lpjishu/article/details/78291152
Topic (Nature Thesis)

Mastering the game of Go without human knowledge author

David silver1*, Julian schrittwieser1*, Karen simonyan1*, Ioannis Antonoglou1, Aja Huang1, Arthur Guez1,
Thomas Hubert1, Lucas Baker1, Matthew Lai1, Adrian Bolton1, Yutian Chen1, Timothy Lillicrap1, Fan Hui1, Laurent Sifre1, Ge Orge van den Driessche1, Thore Graepel1 & Demis Hassabis1 Summary

The goal of artificial intelligence has long been to learn an algorithm from an ignorant toddler to a super expert in challenging areas. Recently, Alphago became the first program to defeat the world champion in Weiqi games. Among them, Alphago uses neural network to estimate the position of chess and the tree search algorithm used to select the position of chess. These networks use the moves of high grade chess players to train through supervised learning, and then through the self-chess to complete the enhanced learning. In this paper, we propose a completely independent reinforcement learning algorithm, which does not require manual data, or is based on the rules of the game guidance or domain knowledge. Alphago became his own teacher: training a neural network to accomplish Alphago's drop predictions and winning the chess contest. The network also improved the ability of tree search, the result is to be able to have a higher quality in the next hand drop choice and stronger self-chess ability. From the beginning of the ignorant child, our new program-alphago zero reached the level of the super expert, in with the previous development of Alphago (referring to and Li Shishi chess alphago) in the chess, achieved a 100-0 victory. Introduction

The use of supervised learning to replicate the decision results of human experts has made a great progress in artificial intelligence. However, expert data often require considerable financial resources, and there are also unreliable and difficult to access shortcomings. Even when reliable data is obtained, the performance of systems trained in this way is enforced (5). In contrast, intensive learning systems are trained through their own experience, so in principle they can transcend human capabilities and work in areas where human experience is missing. In recent years, the use of intensive learning and training of the deep neural network has made rapid progress. These systems have surpassed the level of human players in video games, such as atari[6,7] and 3D virtual Games [8,9,10]. However, the most challenging areas of play in terms of human intelligence, such as Weiqi, are widely considered to be a major challenge in the field of AI. These games need to be completed in a large search space to complete the precise and complex of the pre-sentence (that is, we say to see a few moves). All the general methods in this field are not up to the level of the human player.

Alphago is the first program to reach the level of human super experts in Weiqi field, and the first version we developed-alphago Fan defeated the European Go Champions Fan Hui in October 2015 (樊麾: The head coach of the French national Weiqi team). Alphago uses two depth neural networks: one is the probability that the policy network outputs the next drop position, and one is the evaluation of the position of the value network output (i.e. drop). The strategy network can accurately predict the drop of the high grade chess player through supervised learning, and then enhance the system through the value gradient enhancement learning. The value network uses the Strategy network's ego game to predict the game's wins and completes the training. After the training is over, the two networks combine the Monte Carlo search algorithm to provide a look forward to future situations. Using the strategy network to reduce the high probability drop search process, the use value network (combined with the Monte Carlo fast-moving strategy) completes the evaluation of the drop position in the tree. In later development versions, we called Alphago Lee, using the same approach as before, defeating Lee Sedol (18 International champions) in 2016.

Our current program, Alphago Zero, is different in many ways than the previous versions of Alpha go and Alpha Lee. Most important of all, Alphago Zero is completely independent of learning through self-learning to complete the training, from the beginning of the random game without any supervision or use of artificial data. Second, it uses only the black and white on the chessboard as the input feature (the previous Alphago has many features that are artificially constructed). Third, use only one neural network, not separate strategic networks and value networks. Finally, only a simplified version tree search based on a single neural network is used to evaluate the drop probability and the effect of drop on the situation, and the Monte Carlo method is no longer used. In order to achieve these aspects, we developed an enhanced learning algorithm which can complete the forward search in the training process, in order to quickly improve and accurately stabilize the learning process. For these differences in the structure of these networks, the difference of the search algorithm has been the training process of different we will be in the methods section for further elaboration. Alphago Zero Reinforcement Learning theory

Our new approach uses deep neural network fθ with parameter θ. The neural network takes the position and its historical original graph representation as input, and outputs the moving probability and value (P,V) =fθ (s). The vector p of the mobility probability represents the choice of each move a (including pass), PA = Pr (A | s) of the probability. Value V is a scalar estimate that estimates the current player's probability of winning from position S. The neural network combines the role of strategic network and value network 12 into a single architecture. The neural network includes the convolution layer of many residual blocks, the batch normalization and the rectifier nonlinearity (see method).

The neural network in Alphago Zero is trained from self play through a new reinforcement learning algorithm. In each location, perform MCTs search, which is guided by neural network fθ. MCTs Search outputs the probability pi for each move. These search probabilities usually choose to move more strongly than the original mobile probability p of the neural network fθ (s); therefore, MCTs may be considered as a powerful policy change provider. Using an improved MCTS strategy to select each action, and then use game winner Z as a sample of values, you can search for yourself-and can be considered a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators.

Figure 1a | self-reinforcement learning in Alphago zero.

The program s1,...,st for its own game. In each location St, use the latest neural network fθ to perform mctsαθ (see Figure 2). Choose to move according to the search probability of MCTs calculation,
In 〜πt. The terminal position St is scored according to the game rules to compute the game winner Z. Figure B,alphago Zero's neural network training.

The neural network passes the original position St as its input, passing it to many convolution layers with parameter θ,
And the output represents the vector pt of the probability distribution of the movement and the scalar value VT that represents the probability of the current player winning in position St. The neural network parameter θ is updated to maximize the similarity between the strategy vector PT and the search probability πt, and the error between the predicted Victor VT and the game winner Z is minimized (see equation (1)). The new parameter is used for the next self-multicast iteration.

Over the course of the policy iteration 22, 23: Update the parameters of the neural network so that the move probability and value (P,V) =fθ (s) are closer to matching the improved search probability and the self-seeding winner (Π,Z); These new parameters are used for the next playback, making the search more powerful. Figure 1 illustrates the self playing training line.
MCTs uses neural network fθ to guide its simulations (see Figure 2)

Each edge of the search tree (s,a) stores a priori probability P (s,a), the number of accesses N (s,a), and the Action Value Q (s,a). Each simulation starts at the root state and iteratively selects the maximum upper limit of confidence Q (s,a) + U (s,a) movement, where U (s,a) αp (s,a)/
(1 + N (S,a)) (refer to 12,24) until the leaf node s ' is encountered. (P (S ', •), V (S ')) =fθ (s '), extending and evaluating the leaf position over the network, based on only two priori probabilities and evaluations. Each edge (s,a) traversed in the simulation is updated to increase its access number N (s,a) and update its action values to estimate the mean value of these simulations, Q (s,a) = 1/n (s,a) σs ' | S,a→s ' V (S ') where S,a→s ' indicates that after moving a from position s, the simulation eventually reaches S '.
MCTs can be considered as a kind of self multicast algorithm, given the neural network parameters θ and root position s, compute the recommended mobile game search probability vector, π=αθ (s), with each move of the exponential access ratio, πaαn (s,a) 1/τ, of which tau is the temperature parameter.

The neural network is trained by self reinforcement learning, using MCTS to compute the algorithm of each action.
First, the neural network is initialized to the random weight θ0. Generate your own calculations for each subsequent iteration i≥1 (Figure 1a). In each time step t,mcts searchπt=αθi-1 (ST) executes the network fθi-1 using the above statement and moves by sampling the search probability π. When two players pass, the game terminates at step t when the search value falls below the threshold or when the game exceeds the maximum length, then the game scores to give the final award rt∈{-1,+ 1} (see method). Each data time step t is stored as (ST,ΠT,ZT), where ZT =±rt is the game winner.
From the point of view of the current player in step T. In parallel (Fig. 1b), the new network parameter θ is trained from the data (s,π,z) that is uniformly sampled from all time steps of the last self seeding. The neural Network (P,V) =fθi (s) is adjusted to the difference between the assumed predictive value V and the self-broadcast winner Z, and maximizes the similarity of the mobile probability p of the neural network to the search probability π. In particular, the loss function L, which sums the mean square error and the cross entropy loss, is adjusted by gradient descent to adjust the parameter θ (p,v) =fθ (s) and L = (z-v) 2-πtlogp + cθ2 (1) where c is the parameter controlling the regularization level of the L2 weight.
(prevent excessive co-ordination). Alphago Zero's final performance


We then use the larger neural network and longer duration to apply our intensive learning process to the second instance of Alphago zero. The training starts again from completely random behavior and lasts about 40 days.
In the training process, generated 29 million times to amuse oneself of the game. Parameters are updated from 3.1 million small batches per 2048 positions. The neural network consists of 40 residual blocks. The learning curve is shown in Figure 6a. A regular game in training is shown in extended data Figure 5 and supplemental information.

We evaluated Alphago Fan,alphago Lee and several previous go programs for Alphago Zero through internal competitions. We also target the most powerful existing programs, Alphago Master-based on the algorithms and architecture presented in this article, (but with human resources and functionality)-beat the most powerful human professional 60-0 (in our assessment) online, and all processes are allowed to move at 5 times a time of thought; Alphago Zero and Alphago master are in 4 TPU on the single machine play; Alphago fan and Alphago Lee are distributed on 176 GPU and 48 TPU respectively. We also include a player who is completely based on Alphago Zero's original neural network, and the player simply chooses to move with the maximum probability.

Figure 6b shows the performance of each program on the ELO scale. The original neural network, without using any forward-looking, achieved an EO rating of 3,055. By contrast, Alphago Zero has a rating of 5,185
Lee (beat Lee Sedol), Alfa Fan (beat Fan Hui) as well as previous go program Crazy Stone, Pachi and Gnugo. Each program has 5 seconds of thought time each time. Alphago Zero and Alphago master play on a single machine on Google Cloud; Alphago fan and Alphago Lee are distributed on many machines. Also includes Alphago Zero's original neural network, which directly selects the maximum probability of PA's move a, without using MCTs. The plan evaluates the 25:200-point gap in the Elo scale to match the 75% winning probability.
Alphago Master 4,858,alphago Lee for 3,739,alphago fan for 3,144.
Finally, we evaluated the mind of Alphago Zero and played a 100-hour game with Alphago master, with 2 hours of time to control. Alphago Zero won 89 games to 11 games (see expanded data Figure 6 and supplementary information). Conclusion

Our findings show that, even in the most challenging areas, pure reinforcement learning is entirely feasible: without human examples or guidance, it is impossible to transcend knowledge in the field of basic rules, and it is possible to train to a superhuman level. In addition, compared with the data trained by human experts, the pure reinforcement learning method needs to train for several hours and achieve better asymptotic performance. Using this approach, Alphago zero beats the most powerful version of Alphago, which uses hand-made resources for a lot of training.
Humans have accumulated go knowledge from millions of games played over millions of years, incorporating patterns, resources and books. In a few days, Alphaura Zero was able to recreate the go knowledge and novel strategies for providing new insights into the oldest games.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.