Analysis of Alphago of Google DeepMind Weiqi program _google

Source: Internet
Author: User
Copyright belongs to the author.
Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.
Author: Tanaka
Link: http://zhuanlan.zhihu.com/yuandong/20607684
Source: Know

Recently, I took a close look at the article published by Alphago in the journal Nature, writing some analysis to share with you.

Alphago This system consists mainly of several parts:

1. Moves network (Policy network), given the current situation, predicts/samples the next moves.

2. Rapid rollout (FAST), the target and 1 the same, but at the appropriate sacrifice moves quality conditions, the speed is 1000 times times faster than 1.

3. Valuation Network (Value network), given the current situation, it is estimated that Baek Seung or black wins.

4. The Monte Carlo search (Monte Carlo tree search,mcts) connects the above three parts to form a complete system.

Our darkforest and Alphago are also built with a 4 system. Darkforest, compared with Alphago, strengthened by 1 in training, less 2 and 3, and replaced 2 in the open source software Pachi default policy (Default policy). The following sections are described below.

1. Moves network:

The moves network takes the current situation as input, predicts/samples the next moves. Its predictions not only give the strongest hand, but give a score to all the possible next points on the board. There are 361 points on the chessboard, it gives 361 numbers, good strokes are higher than bad strokes. Darkforest in this part of the innovation, by predicting three steps rather than one step in the training, improve the quality of the policy output, and their use of enhanced learning to the moves network (RL Network) The effect of the equivalent. Of course, they did not use the enhanced Learning Network in the final system, but instead used a network (slnetwork) that was trained directly through training, citing the lack of moves in the rlnetwork output, which was detrimental to search.

The interesting thing is that the Alphago is only using a 192-width network for speed, and does not use a network with the best width of 384 (see Figure 2 (a)), so if the GPU is a bit faster (or more), the Alphago will definitely get stronger.

The so-called 0.1 seconds to go one step, is the pure use of such a network, under the highest confidence level of the legal move. This approach did not do a search, but the situation is very strong, will not fall into the local combat, said it modeled the "chess sense" is not wrong. We put the darkforest moves network directly on the KGS has 3d level, so that everyone is amazed. It can be said that this wave of Go AI breakthrough, mainly thanks to the moves network breakthrough. This in the past is unthinkable, used to be based on the rules, or based on the local shape plus a simple linear classifier training of the walk generation method, the need to slowly adjust the parameters of the year, only progress.

Of course, only use moves network problem is also a lot of, on our darkforest to see, will disregard the size of the unnecessary fight, will be unnecessary to take off first, regardless of local lives, to kill mistakes, and so on. A bit like a master without serious thinking of the handy chess. Because the moves network has no value to judge function, just by "intuition" in chess, only after adding a search, the computer has the ability to judge the value.

2. Fast-Moving child

That has moves network, why still want to do fast walk son. There are two reasons, first moves network running speed is relatively slow, Alphago said is 3 milliseconds, we are similar here, and fast walking can do a few microseconds level, 1000 times times worse. So when the moves network does not return to let the CPU do not idle first search is very important, wait until the network return to a better way, then update the corresponding information.

Second, the fast-moving child can be used to evaluate the disk surface. Because of the astronomical number of possible situation, go search is no hope to the end of the search to a certain extent to the existing situation to make an assessment. In the absence of a valuation network, unlike the state can be calculated by calculating the score of the disc to do more accurate valuation, go disk is estimated to be carried out by simulating the son, from the current plate all the way to the end, regardless of the fork to calculate the outcome, and then the outcome of the current disk value as an estimate. Here's a trade-off: in the same time, the simulation of the quality of the walk, the single value of high precision but slow to move the child speed, and even the use of random walk, although the single value of low precision, but can be simulated several times average, the effect may not be bad. Therefore, if there is a high quality and rapid pace of the strategy, it is very helpful to improve the force of chess.

In order to achieve this goal, the neural network model is too slow, or to use the traditional local feature matching (locally pattern matching) plus linear regression (logisticregression) method, although not new but very good, almost all advertising recommendations, Bidding rankings, news sequencing, are all used by it. Compared with more traditional rule-based schemes, it has the ability to automatically adjust parameters with gradient descent method after absorbing many masters, so it will be quicker and more easy to improve performance. Alphago uses this method to achieve 2 microseconds of the speed of the walk and 24.2% of the accuracy rate. 24.2% means that its best predictions and the chess player's 0.242 probability is coincident, in contrast, the moves network on the GPU 2 milliseconds to achieve a 57% accuracy rate. Here, we see the trade-off between the speed of the walk and the precision.

In contrast to the training depth learning model, the fast-moving child uses local feature matching, which naturally requires some domain knowledge of Weiqi to select Local features. This alphago only provides the number of local features (see Extended Table 4) without specifying the specifics of the feature. I've also recently experimented with their approach to achieve 25.1% accuracy and 4-5 microseconds of walking speed, but the whole system doesn't replicate their level. I feel like 24.2% doesn't fully summarize their fast-moving chess force, for as long as the key step is wrong, the situation is completely wrong, and fig. 2 (b) is more able to reflect the accuracy of their fast-moving to the disk situation, to reach their level of Figure 2 (b), than simply match the 24.2% To do more work, and they did not emphasize this in the article.

After Alphago has a quick walk, no need to moves network and valuation network, without the help of any depth learning and GPU, do not use enhanced learning, on the stand-alone has reached 3d level (see Extended Table 7 penultimate line), which is quite powerful. Any go program that uses the traditional method to achieve this level on a single machine will take several years. Before Alphago, Aja Huang once wrote a very good Go program, in this respect is believed to have a lot of accumulation.

3. Valuation Network

Alphago's valuation network can be said to be the icing on the cake, and from Fig 2 (b) and extendedtable 7, Alphago will not be too weak without it, at least at 7d-8d level. Less valuation network, 480 points less, but less moves network, the grade will be less than 800 to 1000 points. It is particularly interesting to estimate the situation (2177) only by using a valuation network, which is less effective than using a fast-moving child (2416), and only two of them can be improved in combination. My guess is that the valuation network and the fast-moving sub are complementary to each other, at the beginning of the game, we are relatively friendly, the valuation network will be more important, but in the complex of the dead or the kill, it is more important to estimate the disk surface by fast walking. Given that the valuation network is the hardest part of the entire system (which requires 30 million innings), I guess it's the latest to make and most likely to improve.

On the generation of training data of valuation network, it is worth noting that the small part of appendix in the article. Unlike the moves network, each chess only takes a kind of training to avoid the fit, otherwise the input of the same game is slightly different and the output is the same, the training is very unfavorable. That's why it takes 30 million innings, not 30 million. For each innings, the sampling is very fastidious, first with SL network to ensure the diversity of moves, and then random walk, take the disk, and then with more accurate rlnetwork go to the end to get the most correct victory and defeat estimates. Of course, the effect of doing this is better than using a single network, I am not good to say.

One surprise to me was that they did not do any local death/kill analysis, purely by training a fairly good valuation network with a brute force training method. This explains to some extent that the depth convolution network (DCNN) has the ability to automatically decompose problems into sub problems and solve them separately.

In addition, I guess when they took the training samples, they decided that the final outcome was the Chinese rule. So the March and the Li Shishi are also required to use Chinese rules, otherwise if they are replaced by other rules, we need to train the valuation network (although I don't think the result is too big). As for why the Chinese rules were used at the outset, my guess is that programming is very handy (I feel the same way when I write darkforest).

4. Monte Carlo Search

This part of the basic use of traditional methods, there is not much to comment on, they are with a priori uct, that is, the first consideration dcnn think the better way, and then wait until each method to explore the number of times, the choice of more trust to explore the winning value. Darkforest, however, directly selected the first 3 or 5 of the DCNN recommended search. My initial experiment was similar, of course, their approach is more flexible, in the case of allowing the use of a large number of searches, their approach can find some dcnn think bad but the situation is crucial to the way.

One interesting place is when you search for a leaf node, do not immediately expand the leaf node, but wait until the number of visits to reach a certain amount (40) to expand, so as to avoid the creation of too many branches, scattered search attention, but also to save the valuable GPU resources, while in the expansion of the leaf node will be more accurate value of the disk surface. In addition, they also use some techniques to avoid multiple threads searching the same way at the start of the search, which we noticed in the darkforest and improved.

5. Summary

Overall, this whole article is a systematic effort, not a victory that one or two small dots can achieve with a breakthrough. Behind the success, is the author, especially the two first author David Silver and Aja Huang, in the doctoral stage and after graduating five years of accumulation, not overnight can be completed. It is well deserved that they can make alphago and enjoy the present honor.

From the above analysis can also be seen, compared to the previous go system, alphago less dependent on the field of Weiqi knowledge, but still far from the extent of the general system. Professional chess players can understand the style of the opponent after a few innings and take the appropriate strategy, a senior gamer can play a new game a few times after the start, but so far, artificial intelligence system to reach the human level, still need a large number of samples of training. Can say, no thousands of years to many chess players in the accumulation of Weiqi, there is no go AI today.

In Alphago, the role of enhanced learning (reinforcement Learning) is not as great as it is imagined. Ideally, we hope that the AI system will dynamically adapt to the environment and the opponent's moves in the game and find a way to reverse it, but in alphago the enhancement of learning is more used to provide better quality samples, and for supervised learning (supervised Learning) to train a better model. There is still a long way to go to strengthen learning in this area.

In addition, according to their article, Alphago the whole system on a single machine already has a professional level, if Google is willing to open tens of thousands of machines and Li Shishi duel (this is easy for it, change parameters on the line), I believe the game will be very exciting.

===========================

Some updates.

Question 1: "Alphago MCTs do rollout, in addition to the use of fast walking, but also used to search the tree's already part, looks like the amaf/rave in turn: AMAF is to transfer the information of the fast-moving child to the other unrelated parts of the tree, Alphago is to enhance the fast walking of other unrelated parts of the tree. I wonder if this is one of the reasons why it is stronger than other dcnn+mcts. "

This approach in the solution to the problem of the article appeared, will improve the search efficiency to some extent, but the increase in how much is not known.

Question 2: "Rollout's approach to quality can lead to a decline in chess force." ”

There are two situations here, tree policy and default policy. In Alphago's article has said, the tree policy distribution can not be too sharp, otherwise in search too much attention to some look good, may make chess force drop. But in addition to this reason, generally speaking, the tree policy become better chess force will become stronger.

Default policy this side, that is (half) random walk to the last and then assignment, it is very complicated, the quality of the better may not be able to assess the situation more accurate. Default policy need to ensure that the life and death of each piece of chess is generally correct, do not take the survival of the game or vice versa, but the requirements of the general situation is not so high. The two sides can fully match each piece of chess, and then move to another piece, rather than to say that the other side to occupy the upper hand.


from:http://zhuanlan.zhihu.com/yuandong/20607684

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.