The following is an online paper:
"Thinking" of computer players"
Li Cong licong@hotmail.com
Keywords: Game Search Learning
Abstract: This article focuses on the principles and status quo of computer games and selectively introduces some of the latest search and learning technologies.
1. Game and Computing Science
Readers may have heard of the battle between IBM's supercomputer deep blue and Kasparov, the strongest player in chess from some media reports a few years ago; readers may have played chess, chess, and go games on their computers. How do computer players think? Are they as intelligent as humans?
In my opinion, many easy-to-use arguments are extremely misleading.
I remember the story I once heard when I was a child. A computer player and a master are playing against each other. when the situation is unfavorable and the game is about to fall down, the computer player will crash to the master. Of course, there is no way to identify the real facts and sources of information, but it is not so much a distortion in information dissemination as the rich fantasy capabilities of rumor creators. If the computer player is so smart, I am afraid that everyone is not a computer opponent.
It is also interesting to say that when playing chess with a computer player, the computer will be at a loss if it tries to play games that do not conform to the regular rules. Is that true? I want to solve this problem after reading Part 1.
The comments of the so-called supercomputer Deep Blue prediction football competition champion have already reached the level of sensation. It plays well in chess, but it is absolutely impossible to predict the results of the game with its ability to play chess.
Games are similar to playing chess. In a narrow sense, blogs in the game refer to gambling, while others refer to playing chess. Gambling is not an advocate. Here we talk about a single-finger game. For a game, if one party wins, then the other party loses. In some games, if the two sides are deadlocked, a game is formed. In short, the benefits of one party at any time of a game are equivalent to those of the other. That is to say, there will be no "win-win" situation. This type of problem is called a zero-sum game, because the income of both parties is equal to 0.
The realization of the game on the computer appeared more than 10 years after the birth of the computer. When people try to use computers to simulate human intelligence, the game has become a typical problem. At that time, A. L. Samuel, one of the founders of AI, compiled a program for playing the western games. In 1959, the program beat the designer himself, and in 1962 it beat a u. S. state champion.
There are various computer players in the world. The chess program that can run on ordinary computers has reached the level of professional players. As far as black and white games are concerned, humans are no longer computer opponents. However, as far as go is concerned, there is still a long way to compete between computers and people.
If a computer player beats a human player, can it be said that it is as intelligent as a human player? This question will be discussed at the end. First, I will show you how computer players play chess.
2. Select the best option
In simple words, the principle of playing chess by a computer player is to select the most advantageous method it deems.
A ()
/|/
B (4) C (3) D (-1)
Let's look at the figure above. Assume that at some point in time, it is a situation for computer players. Of course, the computer player understands the rules of playing chess, so it finds that there are a total of three methods that comply with the rules (for the convenience of drawing, so it is less, in fact, in chess and Chinese chess, there are dozens of Rule-compliant methods and up to 200 or 300 go games ). These three methods will form the situation B, C, and D respectively. Now, we assume that a computer player can determine the advantages and disadvantages of a situation and calculate a score for each situation. For example, the score for situation B is 4 and that for situation C is 3, the score of situation D is-1. Of course, we can make some rules: the higher the score, the better the score. The lower the score, the worse the score. When the score is 0, the average score of both sides indicates the advantage of computer players, negative score indicates that computer players are at a disadvantage. As a result, the computer player chooses the most favorable method for it, that is, the first method, and finally forms the situation B. In this case, as a by-product, situation A is considered to be 4 points, because the following two methods are meaningless to computers, therefore, the score of B is the score of.
The next question is how computers can score a score for each situation. This is called situation judgment in chess.
Now we take Chinese chess or chess as an example (this problem is very complicated for go). The simplest situation judgment is to evaluate the comparison between the two sides in the following board. For example, in Chinese chess, every car scored 8 points, every horse or cannon scored 3.5 points, and every soldier, scholar, and camera scored 1 point. The handsome guy is priceless, and the question is worth 1000 points, it is easy to calculate the strength of both sides. The difference between the two forms the simplest situation judgment. Of course, this situation is very rough.
A slightly more complicated situation judgment should consider the advantages of the situation in addition to the sub-strengths. For example, one side of the horse is in a very active position, while the other side of the horse is still in the original position and there is a clear gap between the two. In this case, you need to convert these gaps into scores and add them to the situation judgment. In computer systems, this is the most popular or commonly used method for situation judgment.
Generally, we can use a mathematical formula to determine the situation. Here is a simple example:
N factors (or characteristics) are considered in the situation ). Indicates whether the I-th factor appears in the current situation G. If yes, 1 is used. If no, 0 is used. For example, 6th factors refer to the 1st vehicles on one side, when the car exists, X6 = 1. If the car does not exist, X6 = 0. Wi indicates the score calculated by the I factor in the situation judgment. This is called the weight. For example, if the score of a car is 8, then w6 = 8. Then the situation of situation G is judged:
F (G) = w1x1 + w2x2 +... + Wnxn = s...
The symbol S is called the accumulate operator. If the reader does not know the symbol before, just look at the equation above and think of it as an addition command in the for loop, it is easy to understand. This is a symbol introduced for ease of writing.
3. Search magic
If the situation is very accurate, the story simply ends. Unfortunately, the problem is not that simple. We (computer or human) will never get an accurate situation judgment. This is because the situation judgment is static, and the game board is dynamic. It is not easy to obtain dynamic information from static features. Of course, if you need to be safe, you must include all the factors that can be thought of to influence the situation. However, the number of factors is an astronomical number, it is impossible for us to assign a weight to all factors, nor to calculate the situation of the game out within a limited period of time.
As we can see, when playing a game, people cannot accurately judge the static situation at a Glance. In general, they often need to calculate a few steps later, let's take a look at the situation after several moves. This is called "Multi-computing victory". That is to say, the more profound the view, the more likely the one to win. This way of thinking was immediately used on a computer.
3.1 min Decision Tree
Take the image shown in part 1 as an example. If you consider it further
A ()
/|/
B () C () D ()
/| // |/
E (5) f (3) g (2.5) H (2) I (-2) J (10)
This is a tree. We can see that in three situations, B, C, and D, there are exactly two methods that comply with the Rules to form the situation E, F, G, H, I, J, the situation of these situations has been determined. The task of a computer player is to use these known conditions to select a situation B, C, and D which is the most favorable for it. We can find that situation J has the largest score, so should we select situation D to form the most favorable situation for computer players?
Assume that the opponent of a computer player (maybe a person or a computer program) is facing situation B to help illustrate the problem without confusion, then he will definitely choose the best method for himself. Because it is a zero-sum game, the most favorable method for people is the most unfavorable Method for computers, then among E and F, he will choose F. At this time, if the computer player chooses B, the other party will select F. Finally, the computer player gets 3, that is, B's situation judgment result should be 3. Similarly, if a person is in C, he will select H, so f (c) = f (H) = 2. Similarly, F (d) =-2. Finally, we find that B has the highest score, so the computer chooses B. At this time, f (a) = F (B) = 3.
According to the above speculation, if the computer chooses situation D because F (j) = 10, then the opponent chooses I, and the computer finally gets the worst result-2.
From this we can see that in the case of playing chess by a computer player, it always chooses the highest score from the situation derived from the situation at the next level, while in the case of playing chess by the opponent, it always selects the lowest score in the next level derived from the situation. In terms of data structure, we assume that the scores of all possible situations after a few moves have been calculated, then, from the bottom layer of the tree, push up. In the opponent's playing chess layer, each node takes the minimum value of its subnode, and in the computer's playing chess layer, each node obtains the maximum value of a subnode. This is called the Minimum Spanning Tree.
Finally, the problem is transformed into traversing the Minimum Spanning Tree derived from a certain situation. This traversal can be processed with a deep priority search. During the search process, each node in the current path stores the current optimal value found so far. Search is implemented using a recursive function, which is used to calculate the optimal value of a node. When a node obtains the return value of a subnode, it selects a better value from the current optimal value and return value as the current optimal value. When all the child nodes of a node are searched, the current optimal value is the actual optimal value of the node, that is, the score of the node. Then, the value is returned to its parent node. The maximum and minimum search pseudo code is given below:
Double Minimax (INT depth, position P)
{/* Calculate the score of P in the situation */
Int I;
Double F, T;
If (depth = 0)
Return evaluate (p);/* leaf node */
Find the successor of P P_1 ,..., P_w;
If (playing chess on a computer)
F =-1000;
Else
F = 1000;/* initialize the current optimal value. Suppose 1000 is an impossible value */
For (I = 1; I <= W; I ++)
{
T = Minimax (depth-1, P_ I );
If (T> F & computer chess)
F = T;
If (T <F & opponent playing chess)
F = T;
}
Return F;
}
Note that in the search process, we are not only concerned about the score of each situation, but also about how to get the score, that is, in this situation, the score can be obtained only when a computer player or opponent chooses a method. This function is not included in the preceding pseudo code. However, as long as you understand the above pseudo code, you will be able to add the corresponding code yourself.
Furthermore, for a zero-sum game, assume that the form of judging the function as g from the competitor's perspective is G =-F. In this way, in the deep-Priority Search, when playing a game on a computer, consider the largest F in the child node, and the smallest F in the opponent's chess node, that is, the largest G, in this way, the two are completely unified to take the maximum value. In the return value, you only need to change the symbol to convert the F and G values. For example, in the previous example, when the opponent is playing game B, G (e) =-F (e) =-5, g (f) =-F (f) =-3, g (B) take the maximum value, then G (B) =-3, then return to the computer to play the game situation A, change the symbol, use F (B) =-g (B) = 3. at point A, the maximum value of F is obtained. This is called the negative largest search, which is easier than the minimum largest search. The following is a pseudo code for negative largest search. I believe that readers will understand it after reading it.
Double negamax (INT depth, position P)
{/* Calculate the score of P in the situation */
Int I;
Double F, T;
If (depth = 0)
Return evaluate (p);/* leaf node */
Find the successor of P P_1 ,..., P_w;
F =-1000;/* initialize the current optimal value. Suppose 1000 is an unattainable value */
For (I = 1; I <= W; I ++)
{
T =-negamax (depth-1, P_ I );
If (T> F)
F = T;
}
Return F;
}
3.2 α-β Cropping
The new problem arises again. Suppose we are considering Chinese chess or chess, and now we hope we can search for five rounds each time. Five rounds are equivalent to a total of 10 moves on both sides. Each time there are dozens of options to choose from, let alone 30 options each time, therefore, the number of leaf nodes in the minimum decision tree is as high as 3010, which is a very large number. The more nodes you search for, the more time you spend in the search. We can't stand it for a year. The simplest way to reduce the search node is to reduce the search depth, which greatly affects the playing power of computer players. Is there a way to reduce the number of nodes to be searched without reducing the search depth?
A (3)
|
B (2)
/|/
C (4) E () f ()
Assume that it is a local part in the deep-priority search process, and fnow is used to represent the current optimal value of the node in the current state. B is the node for playing chess on the computer, and A and C are the nodes for playing chess on the opponent. Currently, the subtree to which node C belongs has been searched and the time from node C to Node B is returned, the node in the box indicates the current optimal value found so far, that is, fnow (A) = 3, fnow (B) = 2, fnow (c) = f (c) = 4. We know that B is the node for playing chess on a computer, so when F (c)> fnow (B), the current optimal value of Node B should be updated, that is, fnow (B) = f (c) = 4. At this time, although node E and F have not been searched, but Node B is playing chess on the computer, the maximum value in the sub-node should be obtained. The current optimal value will only increase with the search but will not decrease, therefore, F (B)> = fnow (B) = 4. While node A is the opponent playing chess, it should take the minimum value of the sub-node, and F (B)> = 4> = 3 = fnow (), the current value must be smaller than the actual optimal value of Node B. Therefore, when the opponent is at node A, it will not select the next method of Node B. That is to say, Node B will not affect the optimal value of node. Therefore, Node B can directly return its parent node A without searching node E and node F.
In the search process, the current optimal value of the computer playing chess node is called Alpha value (that is, the maximum value of B in the previous example ), the current optimal value of the opponent's chess node is called the beta value (the minimum value of A in this example ). In the search process, the α value increases and the β value decreases. The two constitute an interval. This interval is called a window, and the final optimal value of the opponent's chess node falls into this window. Pruning occurs once the return value of the child node obtained at the computer's chess node is greater than the beta value.
Similarly, a-B cropping has a concise version with the largest negative value. In this case, the Alpha value of the opponent is the beta value of the negative computer, and the Alpha value of the opponent is the Alpha value of the negative computer. The pseudo code cropped by A-B is as follows:
Double alphabeta (INT depth, double alpha, double beta, Position & P );
{/* Calculate the optimal value of P */
Int I;
Double T;
If (depth = 0)
Return evaluate (p);/* leaf node */
Find the successor of P P_1 ,..., P_w;
For (I = 1; I <= W; I ++)
{
T =-alphabeta (depth-1,-Beta,-Alpha, P_ I );
If (T> alpha)
If (T> beta)
Return T;
Else
Alpha = T;
}
Return Alpha;
}
3.3 negascout search
Under normal circumstances, in the root node of the search tree, that is, the computer player needs to make decisions, we can call the alphabeta function in 3.2 to get the root node score, at the same time, we can also draw a conclusion on which game should be taken. When calling this function, we should make the alpha =-1000, Beta = 1000, where 1000 is a large enough number. This is because when the root node is located, We need to initialize the window to the maximum. During the search process, as information is obtained, the window will gradually decrease and finally converge to the optimal value of the root node.
Now let's look at an interesting phenomenon: What will happen if we call alphabeta (D, B-0.1, B, P? Here 0.1 is a small constant, while B is a numerical value.
Analyze it directly from the program, and now the form parameter Beta is equal to B. If the return value of the function is greater than B (Beta), it means that for situation P, at least one subnode returns a larger value than B, so its optimal value must be greater than B. At this time, we know that B is the lower bound of the Optimal P value. Note: In this case, a subnode may be cropped when it is too late to search. Therefore, the returned value does not represent the actual optimal value of P, which may be smaller than the actual optimal value. If the return value of the function is less than B, it means that the optimal value of no subnode is greater than B, so the optimal value of P must be less than B (fact line, since the initial window is repeatedly passed in the function recursion process, all returned values only reflect the information larger than B, so the return value is not the actual optimal value of P ). In this case, B is the upper bound of the condition P's optimal value. Therefore, we can conclude that, using this type of alphabeta function call, we can determine whether the optimal value of P is larger than B based on whether the returned value is larger than B.
It is noted that alpha and beta are very close in the function call process, so the window is very small. In the case of a very small window, in each layer of the search tree, the returned value is much more likely than beta, so the possibility of Pruning increases.
This type of alphabeta function call is called zero-window a-B search. Due to a large number of pruning, it consumes less time than the regular a-B pruning. However, regular a-B cropping can obtain the actual optimal value of a situation, while a-B without a window can only obtain the actual optimal value and the ratio of a specific value B, that is, information of an upper or lower bound.
Therefore, we can make an improvement on the-B cropping: before each downward search, we first use the zero-window a-B call to determine whether the search can improve the current optimal value, that is, whether the actual optimal value of a subnode of P is greater than that of Alpha. If it is smaller than alpha, skip this subnode. Further, when the return value of this subnode is greater than Alpha, we know from the previous analysis that its actual optimal value is greater than this value. Therefore, when the return value is greater than beta, the actual optimum value also exceeded beta, and thus the pruning should be performed. This is the negascout search. In this new calculation method, the added time consumption is zero-window a-B call, and if the return value of the zero-window a-B call is smaller than that of alpha, therefore, it can save the time required for calling a-B in a large window. When the return value of A-B in a zero window is larger than that of Beta, further pruning is directly produced. Practice has proved that the zero-window a-B pruning produces a large number of pruning, so the consumption is very small compared with the call of the big window a-B. In general, negascout search is more effective than a-B cropping. The following is a pseudo code of the negascout algorithm, which is not too standard (negative maximum search is also used here) based on the aforementioned ideas ):
Double negascout (INT depth, double alpha, double beta, Position & P );
{/* Calculate the optimal value of P */
Int I;
Double T;
If (depth = 0)
Return evaluate (p);/* leaf node */
Find the successor of P P_1 ,..., P_w;
For (I = 1; I <= W; I ++)
{
T =-negascout (depth-1,-alpha-0.1,-Alpha, P_ I );
If (T> alpha)
If (T <beta)
{
T =-negascout (depth-1,-Beta,-T, P_ I );
If (T> alpha)
If (T> beta)
Return T;
Else
Alpha = T;
}
Else
Return T;
}
Return Alpha;
}
For details about negascout and its further optimization, refer to [1].
3.4 magic
As a matter of fact, it seems that the development of search technology has been quite complete. In fact, this is just the beginning.
The first thing we see is some improved technologies. For example, in the various searches discussed earlier, in addition to storing the stack of the current search path (a branch of the search tree), no memory consumption is required. This approach does not make full use of resources. In the actual computer game program, you need to create a hash to save the situation encountered in the previous search process. In this way, once you encounter the same situation again during the search process (this is very common, for example, several different chess orders may lead to the same situation), you can use the previous computation results. For more information, see [2].
However, the real revolution lies in the new changes in search algorithms.
MTD (f) (memory-enhanced test driver) is an excellent new method. Its idea is very simple: Since the-B call with zero window can determine whether the optimal value of the situation is greater than or less than a fixed value, it is used repeatedly. Assume that the optimal value F in a known situation ranges from Min limit F limit Max (for example, min =-1000, max = 1000), then select a value g in this range, then, use this value to call Zero-window a-B. If the call result is greater than G, it means that G is the lower bound of F, and min = G is obtained, while max = G is the opposite. In this way, the range is reduced. The above process is repeated until the upper and lower bounds converge to the same value.
Although a-B cropping is also used in the MTD (f) algorithm, it is only used as an auxiliary tool. Its thoughts are very interesting, completely jump out of the previous thinking, and appear in front of us in a brand new way.
If you can use a hash to save the features of a search, the efficiency of this algorithm is superior to that of negascout. For more information about MTD (f), see [3]. Here, I simply give the algorithm's pseudo code:
Double MTDF (INT depth, double F, position P)
{
Double G, sup = 1000, INF =-1000, Beta;
G = F;
While (INF <sup)
{
If (G = inf)
Beta = G + 0.1;
Else
Beta = g;
G = alphabetawithmemory (depth, beta-0.1, beta, P );
If (G <beta)
Sup = g;
Else
INF = g;
}
Return g;
}
Note that since the algorithm name contains memory-enhanced, the-B cropping method used here should use hashes to save the previous search results (the alphabetawithmemory function is used here, to distinguish from the original a-B pruning without hash ). In MTD (f), the same situation of repeated searches is greatly increased. When the previous results are retained, the efficiency may be improved. In chess experiments, this algorithm improves the efficiency by 15%.
In addition, there are other search algorithms, such as SSS * [4], B * [5], probcut [6], these algorithms are somewhat different from the algorithms we discuss (or in the evaluation of the situation, or in the pruning, or in the search order) and will not be discussed here, interested readers can refer to the literature for research.
4. Learning from computer players
The development of machine learning brings new hopes for the improvement of computer playing chess. All kinds of learning techniques are applied to a certain part of the game. Literature [7] reviewed the application of machine learning in computer chess games.
Here, I will show a typical solution: to learn the situation judgment function.
4.1 supervised learning
Looking back at the content in part 1, a general situation judgment function (or valuation function) has the following forms:
F (G) = w1x1 + w2x2 +... + Wixi +... + Wnxn = s...
Among them, Xi indicates a certain factor (or feature) of the situation, while WI is the importance of this feature in situation judgment (that is, the corresponding score), which is usually called the weight. If this factor is a child force, it is easier to give a relatively accurate value in Chinese chess or chess, and when this factor is a dominant factor in some situations, how to accurately evaluate these advantages may upset experienced masters.
Although it is difficult for masters to give a general estimate of an isolated situation, for example, the score of a car in Chinese chess, however, for some specific situations, the masters can give a more accurate comprehensive evaluation. This makes it possible for the learning situation to judge the function.
A simple method is to find out many situations and ask the masters to make their own situation judgments for each situation. Then, we take the situation judgment given by the masters as the standard answer and train the situation judgment function. This learning method is called supervised learning (or teacher-Based Learning) because it is carried out under the supervision of a teacher.
For example, there are K situations: G1, G2 ,..., GK. The N features of GJ in the J-th situation are x1 (J), X2 (J ),..., Xn (j. The situations given by the master in these K cases are regarded as H1, H2 ,..., HK. The task is to find the appropriate weights W1, W2 ,..., Wn to satisfy the following equations as much as possible:
| W1x1 (1) + w2x2 (1) +... + Wnxn (1) = h1
| W1x1 (2) + w2x2 (2) +... + Wnxn (2) = h2
<...
|
| W1x1 (k) + w2x2 (k) +... + Wnxn (K) = HK
Note: The reason for this is to be satisfied as much as possible is that the above information about W1, W2 ,..., Wn's equations do not necessarily have solutions. what we hope is that the actual situation will determine the results F1, F2 ,..., FK and the results obtained by the master are H1, H2 ,..., HK is as close as possible.
When the situation judgment function is very complex, this problem is actually an optimization problem. The general method to solve this problem is the gradient descent method (for interested readers, refer to the books on optimization ). Because the current situation judgment function is very simple, many other methods can be used (such as least squares ). The following describes the application of the gradient descent method to this simple practical problem.
First, all weights of W1, W2 ,..., Wn is assigned a random number. Then, repeat the following iterations: At the moment t, for the situation G1, G2 ,..., GK calculation result F1, F2 ,..., FK, according to the following formula, the weight of the next moment (that is, t + 1) is
WI (t + 1) = wi (t) + NN (j = 1. K) sum (HJ-fj) XI (j)
The weights are updated repeatedly until the results of situation judgment cannot be improved. H In the formula is called the learning rate. When the value is large, the learning speed is faster (that is, the number of iterations is reduced), but the learning accuracy is low (accurately speaking, in the final stage, it will produce an oscillation). When the value is smaller, the accuracy increases, but the learning speed slows down.
If a reader has a certain understanding of artificial neural networks, he or she can regard the above process as a learning process of continuous sensor.
4.2 enhanced learning
Supervised learning is not a satisfactory learning method (although in some cases it is irreplaceable ). After all, when such a "teacher" is also very tired, it is hard to imagine asking a master to give his own situation judgment for hundreds or even thousands of situations.
Reinforcement Learning is a special case of supervised learning. At this time, the "teacher" no longer gives accurate answers every time. He or she only gives a good or bad comment on the computer's computing results, or tell the computer how good it is, as a vague judgment. In chess, this kind of "teacher" is everywhere. The match between a computer player and a competitor may become such a "teacher". The victory or defeat of a game can be regarded as a "teacher"'s evaluation of a computer player. Of course, the key to learning is how to effectively use this evaluation.
Taking Chinese chess or international chess as an example, when a game is played, the result of a situation judgment function is often good or bad. This is because the weights in the situation judgment function are W1, W2 ,..., In wn, some are relatively accurate (such as the weight of the Child force factor), while others are not accurate (for example, some indicate the weight of the dominant factor in the situation ). During the process of chess and board games, it often occurs that one party first gains the advantage of the situation, then converts it into a sub-force advantage, and finally wins, or, more generally, one Party first achieved some potential advantages and then turned these potential advantages into real advantages. Generally speaking, computer players often fail to give accurate judgments on potential advantages, while they are correct for more practical advantages (such as sub-strength advantages or a directly composed kill game. Therefore, we can come to a basic and accurate conclusion that, in the later stages of the game, the computer players often get a correct judgment on the situation. At the same time, we can imagine that the correct situation judgment should be close to the situation judgment in the later stages of the Bureau. If the computer can effectively refer to the correct situation judgment in the later stages of the Bureau, it is possible to give a relatively accurate judgment on some situations in the earlier stages of the Bureau, as the standard answer to supervised training mentioned in the previous section H1, H2 ,..., HK.
TD (λ) is a practical reinforcement learning technology. TD (temporal difference) refers to the transient difference, which refers to the difference in judging the situation of two adjacent moments in the Bureau. If the judgment function is accurate in this situation, the difference (that is, the instantaneous difference) should be close to 0. Suppose G1, G2 ,..., GK, GK + 1 is a consecutive k + 1 situation in a game from a certain moment to the end of the game. At this time, according to the previous assumptions, the situation at the conclusion of the Bureau is accurate, that is, for the K + 1 situation, the standard answer to situation judgment is HK + 1 = f (GK + 1 ). Now we use this value to form the standard answer for judging the situation in the first K situations: from the end of the game, we move forward step by step. at every moment, the standard answer for judging the situation is
Hi = (1-λ) f (G (I + 1) + λ h (I + 1)
From this formula, we can see that when λ = 0, the standard answer to the I-th situation should be close to the situation judgment after Lambda = 0; when λ = 1, the criteria for judging all situations are close to HK + 1, that is, the situation at the end of the situation. Therefore, when Lambda takes a value between 0 and 1, the standard answer to situation judgment will be a compromise between the two. After learning, the situation judgment function will judge the adjacent situation in a series of consecutive situations, and gradually approach the final result.
TD (λ) is used in multiple computer game programs. In document [8], this technology is applied (a slight change has been made. If you are interested, please refer to the document on your own) after more than 300 rounds of chess programs on the internet, their scores have increased from 1650 (average) to 2110 (American master level ).
5. Intelligence
When Deep Blue defeated Kasparov for the first time in the round (in the 1996 round), the then world chess champion (which I call him, although the legitimacy of this title was suspected when Kasparov and the International Chess Association broke out, he once said in Times magazine: "I can feel and sniff a new kind of intelligence from the opposite side of the board."
He elaborated on two points of view. First, whatever philosophers think about this, deep blue is smart, at least in Kasparov's view. Here, I will continue to shy away from this philosophical discussion and focus on his second point of view-a new kind of intelligence, that is, unlike human intelligence.
Casparov may not understand the "thinking" approach of computer players, but he can still feel the difference between computer players and humans. Review the argument mentioned in Part 1 that "computer players will be overwhelmed when they encounter games that do not match the regular. In part 2, we have fully understood the principles of playing chess on computers. In fact, in the early stage, computers do follow the background of the book. Once an unconventional approach is encountered, or after the spectrum is reached, the computer player enters the search process. At this moment, any walk that does not conform to the rules will undoubtedly be the most severe blow. Because a computer player does not know much about chess, it only knows what is beneficial to it and what is unfavorable to it. It seems that this argument is not a computer player, but rather a player with a high theoretical level but insufficient practical experience. When a man who is familiar with chess books is obviously not as good as the best chess moves that are theoretically recommended by his opponent, if he lacks the power to gain an advantage, he may be overwhelmed by many complicated changes and eventually fail.
In fact, the potential pattern, pattern, or idea is completely human intelligence (of course, this is not equal to human intelligence), and this kind of thing is not complete by the current computer players. Although deep blue may be more powerful than most humans, it relies only on its powerful computing power. Humans do count back several rounds when playing games, but humans know how to choose, rather than simply exhausting changes. In each round of search, humans do not consider all the following methods that comply with the rules, but select the most likely options (this option is not always correct, it is entirely possible to ignore the best method ). People with high levels know that there should be a plan in the Process of confrontation. This plan is always consistent with the confrontation, and deep blue only relies on the depth of search to replace this plan.
6. New challenges
The confrontation between deep blue and Kasparov is not completely fair. The reason is that the designers of Deep Blue can obtain a large number of practices of Kasparov, But Kasparov has almost no idea about the confrontation of deep blue. If Kasparov is able to get the practice match of Deep Blue, will he be able to find the weakness inherent in the function from the deep blue situation?
If the outstanding performance of a computer in chess matches masks its shortcomings, go has become a new challenge.
For go, at least in the current research, it is not suitable to use global deep search to compensate for the inaccuracy of the situation judgment function. The reason is that there are 200 or 300 options for each game in go, and in many cases (such as when it comes to life or death), it is absolutely necessary to search for dozens of games. This computation is astonishing.
However, situation judgment is equally difficult. In the situation judgment function of chess, the more accurate sub-force advantage valuation is almost overwhelming compared with the less accurate situation advantage valuation, that is, the real error is not too big, even a situation judgment function that only calculates the sub-power advantage and ignores the advantage of the situation relies on Deep Search, it can also produce sloppy results. Therefore, in go, it is almost impossible to find a concise situation judgment function. In general, the situation judgment function first needs to find the strings connected in the situation, and then calculate the life and death of each string using a local search (at this time, the valuation is no longer a score, but one of the three results is dead, active, or dead, and the search scope is reduced.) On this basis, we use some mathematical formulas to determine the attribution of each point in the game.
Even if we do not doubt the correctness of the mathematical formula, we can find a simple problem. The white games in the two figures below are obviously not dead games. In the left figure, you can play the game with only one hand. Therefore, you can play the game with only one hand. In the right figure, black 1 is the player with only one hand, and white is the player with only one hand. I believe that the current computer go program can easily determine
When the above two games appear in the same situation, no matter how the white players play chess, both sides will die. That is to say, the part is not a dead chess game, but the problem is completely different once we take global considerations. However, in the situation judgment of computer programs, the chess on the left is regarded as a white first player, and the chess on the right is regarded as a live player, which produces a false error. The above is just a typical exception, and many similar problems can be found in practice. Can we use some integrated technologies to solve this problem? At least so far, I have not seen this possibility.
Almost certainly, all computer go games have some fatal vulnerabilities. Once this weakness is clear, even amateur experts can make computers more than ten.
Once the dominant form of searching is shaken, the performance of computer players is no longer satisfactory. This is not just a solution to increase the computing speed of computers. When will this pattern change? What is the path to the future of computer players? Everything is waiting for practice to prove.
References
[1] A. reinefeld, an improvement of the Scout tree search algorithm, ICCA Journal, vol. 6, No. 4, 1983
[2] J. Schaeffer, distributed game-tree searching, Journal of parallel and distributed computing, vol. 6, 1989
[3] a. Plaat, J. Schaeffer, W. pijls, A. de Bruin, best-first fixed-depth Minimax algorithms, Aritificial Intelligence, 1996
[4] G. C. Stockman, A minimax algorithm better than alphabeta? Aritificial Intelligence, vol. 12, No. 2, 1979
[5] h.j. Berliner, C. McConnell, B * probability based search, artificial intelligence, vol. 86, no. 1, 1996
[6] M. buro, probcut: an efficient selective extension of the alpha-beta algorithm, ICCA Journal, vol. 18, No. 2, 1995
[7] J. furnkranz, machine learning in computer chess: The Next Generation, ICCA Journal, Vol 19, No.3, 1996
[8] J. Baxter, A. tridgell, L. Weaver, tdleaf (λ): combining temporal difference learning with game-tree search, Australian Journal of intelligent information processing, 1998