Tanaka: Alphago System has a professional level even on a single machine

Source: Internet
Author: User

Tanaka, a researcher at Facebook Ai Group, updated an article in a column that detailed an analysis of Alphago's paper published in the journal Nature, which he said Alphago the entire system had a professional level even on a single machine, and that the game with Li Shishi would be quite exciting.

The following is the original text of Dr. Tanaka's column:

Recently I carefully read the next Alphago in the journal Nature, write some analysis for everyone to share.

Alphago This system consists mainly of several parts:

1. Moves networks (Policy network), given the current situation, predict/sample the next moves.

2. Fast rollout, the same target as 1, but at the right cost of moves quality, the speed is 1000 times times faster than 1.

3. The value network, given the current situation, is estimated to be Baek Seung or black wins.

4. Monte Carlo Tree Search (search,mcts), which connects these three parts together to form a complete system.

Our darkforest and Alphago are also built with 4 systems. Darkforest compared to Alphago, the training was increased by 1, less 2 and 3, and then replaced by the default policy portion of the open source software Pachi 2. The following sections are described below.

    1. Moves network:

The moves network takes the current situation as input, predicting/sampling the next moves. Its predictions not only give the strongest hand, but give a score to all the possible next on the board. There are 361 points on the board, it gives 361 numbers, the score of good strokes is higher than the bad recruit. Darkforest is innovative in this part by predicting three steps rather than one step in training, improving the quality of the strategy output, and the effect of the moves network (RL networks) that they get after self-game with enhanced learning. Of course, they did not use the enhanced Learning Network in the final system, but instead used the network (SL Network), which was trained directly through training, on the grounds that the moves of the RL network output was not changing and was not good for search.

The interesting thing is that in alphago for speed, only a network with a width of 192 is used, and the best 384-width network is not available (see Figure 2 (a)), so if the GPU is a bit faster (or more), Alphago will definitely get stronger.

The so-called 0.1 second step, is purely using such a network, the highest confidence in the legal move. This practice did not do a search, but bigger picture is very strong, not into the local battle, said it modeled a "sense of chess" is not wrong. We put Darkforest moves network directly on the kgs on the level of 3d, so that everyone was amazed. It can be said that this wave of Go AI breakthrough, mainly due to the moves network breakthrough. This is not imagined before, the previous use is based on rules, or based on the local shape coupled with simple linear classifier training of the sub-generation method, need to slowly adjust the parameters of the year, only progress.

Of course, only with moves network problem is also a lot, on the darkforest we see, will disregard the size of needless to rob, will be unnecessary to take off first, regardless of local dead and alive, to kill mistakes, and so on. A bit like a master without serious thinking of the handy chess. Because the moves network does not have the value judgment function, just by "intuition" in the chess game, only after adds the search, the computer has the ability of the value judgment.

  2. fast-moving child

That has the moves network, why still want to do fast walk? There are two reasons, the first moves network running speed is relatively slow, Alphago said is 3 milliseconds, we are similar here, and fast-moving son can achieve a few microseconds level, 1000 times times worse. So in the moves network does not return when the CPU is not idle first search is very important, wait until the network returns better after the move, and then update the corresponding move information.

Second, the fast-moving sub can be used to evaluate the disk surface. Because of the astronomical number of possible situations, the search for Go is hopeless, the search to a certain extent will be to the existing situation to make an assessment. In the absence of the valuation of the network, unlike the national elephant can calculate the score of the disc to do a more accurate valuation of the disk surface estimation to be carried out through the simulation of the son, from the current disk to go all the way to the end, do not consider a fork to calculate the outcome, and then the outcome of the current disk value as an estimate. There is a need to weigh the place: at the same time, the simulation of the quality of the sub-high, high precision single-time estimation of the speed of the child, the simulation of fast speed or even using random walk, although the single-time valuation accuracy is low, but can be simulated several times the average value, the effect may not be bad. So, if there is a high-quality and fast-moving sub-strategy, it is very helpful for the improvement of chess power.

In order to achieve this goal, the neural network model appears too slow, or the traditional local feature matching (local pattern matching) plus linear regression (logistic regression) method, although not new but very good, almost all the advertising recommendations, Bidding rankings, news sorting, are all used for it. Compared with the more traditional rule-based scheme, it has the ability to use gradient descent method to automatically adjust parameters after absorbing many masters, so the performance will be faster and more hassle------ Alphago used this method to achieve 2 microseconds of walking speed and 24.2% of the accuracy of the child. 24.2% means that its best prediction and go Ace of the 0.242 probability is coincident, in contrast, the moves network on the GPU with 2 milliseconds to achieve a 57% accuracy rate. Here, we see the tradeoff between the speed and precision of the walk.

Different from the training deep learning model, the fast-moving sub uses local feature matching, which naturally requires some field knowledge to select Local features. This alphago only provides the number of local features (see Extended Table 4) without specifying the specifics of the feature. I have also recently experimented with their approach, reaching 25.1% accuracy and 4-5 microseconds of walking speed, yet the system-wide integration does not replicate their level. I feel that 24.2% does not fully summarize their fast-moving son's chess force, because as long as the wrong key step, the situation is completely wrong judgment, and Figure 2 (b) More can reflect their fast-moving sub-surface situation estimates of accuracy, to reach their Figure 2 (b) such a level, than simply match 24.2% Do more work, and they do not emphasize this in the article.

After Alphago has a fast-moving child, there is no need to moves network and valuation network, without the help of any deep learning and GPU, without the use of enhanced learning, on a single machine has reached the level of 3d (see Extended Table 7 penultimate row), this is quite powerful. Any go program that uses traditional methods to achieve this level on a single machine can take years. Before Alphago, Aja Huang once wrote a very good Go program, in this regard believe there are a lot of accumulation.

  3. Valuation Network

Alphago's valuation network can be said to be the icing on the cake, from Fig 2 (b) and extended Table 7, without which Alphago will not become too weak, at least at 7d-8d levels. Less than the valuation network, Grade points 480 points less, but less moves network, grade points will be less 800 to 1000 points. What is particularly interesting is that if the valuation network is used only to assess the situation (2177), then the effect is less than the fast-moving sub (2416), only two can be combined to a greater increase. My guess is that the valuation network and the fast-track are complementary to the intraday estimate, and at the very beginning of the game, everyone was more pleasant and the valuation network was more important, but it became more important to estimate the disk by fast-moving children when there was a complication or death. Considering that the valuation network is the hardest part of the whole system (30 million innings of self-game), I guess it was the latest to be made and most likely to be improved.

On the evaluation of network training data generation, it is noteworthy that the text of the appendix in the small part. Unlike the moves network, each chess is trained to avoid overfitting, otherwise the input is slightly different for the same game and the output is the same, which is very detrimental to training. That's why it takes 30 million innings, not 30 million. For each self-game, sampling is very fastidious, first with the SL network to ensure the diversity of moves, and then randomly walk the son, take the disk surface, and then use more accurate RL network to go to the end to get the most correct outcome estimates. Of course, the effect is much better than using a single network, I'm not saying.

One of the things that surprises me is that they did not do any part of the death/kill analysis, purely using the brute force training method to train a fairly good valuation network. This explains to some extent that the deep convolutional network (DCNN) has the ability to automatically decompose problems into sub-problems and resolve them separately.

In addition, I suspect that when they take the training samples, it is the Chinese rules that determine the final outcome. So the March and Li Shishi were also asked to use Chinese rules, otherwise if they were replaced by other rules, the valuation network would need to be re-trained (although I would not expect the results to be too large). As for why the Chinese rule was used at the outset, my guess is that programming is very convenient (I feel the same when I write darkforest).

  4. Monte Carlo tree Search

This part of the basic use of traditional methods, there is not too much to comment on, they use a priori uct, that is, first consider the DCNN think better of the move, and then wait until each move more than the number of exploration, choose more believe the winning value of exploration. Darkforest, however, directly chose the top 3 or the top 5 of the DCNN recommendation for a search. My initial experiment was almost the same, and of course their approach was more flexible, and in the case of allowing a large number of searches, their approach could find some of the moves that dcnn thought was bad but was critical to the situation.

An interesting place is that each time a leaf node is searched, the leaf node is not immediately expanded, but until the number of visits reaches a certain amount (40) to unfold, so as not to produce too many branches, decentralized search attention, but also to save the GPU valuable resources, while in the deployment, the leaf node will be more accurate on the disk surface valuation. In addition, they also used some techniques to avoid multiple threads searching for changes at the same time at the beginning of the search, which we also noticed in darkforest and improved.

  5. Summary

Overall, this entire article is a systematic work, rather than one or two small points with a breakthrough can be achieved by the victory. Behind the success is the authors, especially the two first author David Silver and Aja Huang, who have accumulated over five years in the doctoral phase and after graduation, not overnight can accomplish. They can make alphago and enjoy the present honor, is deserved.


In Alphago, the role of reinforcement learning (reinforcement learning) is not as big as it is imagined. Ideally, we would like the AI system to dynamically adapt to the environment and opponent's moves in the game and find a way to reverse it, but in Alphago the reinforcement learning is more used to provide better quality samples for supervised learning (supervised Learning) to train a better model.  There is still a long way to go to strengthen learning in this area. From the above analysis can also be seen, compared with the previous go system, alphago less dependent on the field of Weiqi knowledge, but still far from the extent of the general system. Professional chess players can understand the opponent's style and take a strategy after watching a few innings, a veteran gamer can play a new game a few times soon after the start, but so far, artificial intelligence system to achieve human level, or need a large number of sample training. It can be said that no thousands of players in the game on the accumulation of Weiqi, there is no go AI today.

In addition, alphago the entire system on a single machine has a professional level, if Google is willing to open tens of thousands of machines and Li Shishi duel (which is easier for it, change a parameter on the line), I believe the game will be very exciting.

Tanaka: Alphago System has a professional level even on a single machine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.