actually incorporates all the information about how good our action $a$ is. For example, they’ve decayed the learning rate as the training progressed and they have also used regularization to prevent overfitting.Finally, their algorithm can be generalized to any deterministic board game.
The problem is that, because the convolutions are successive, it means that a small change in the first filter will introduce a huge change at the end of the chain.
Figure I.6: The components of a Convolutional Layer of the AlphaGo Zero Neural Network
However, we would like our Neural Network to, also, output another piece of information! Hence, if $T=1$ then we will select the action $a_2$ with a probability $= 0.6$, while, if $T=0.1$, we will select it with probability $=0.982 \approx 1$. On the other hand, since we are selecting the action $a_i$, $N(s, a_i)$ will be incremented by $1$. We probably want our Neural Network to The AlphaGo Zero AI relies on 2 main components.
I Deep Learning. One But this is not how we should let the $2$ Neural Networks play against each other! To run the program, they also have used They’ve also used other common tricks. Hence. AlphaGo zero has been able to achieve in a matter of days the knowledge that took Go masters and scholars thousands of years of collective intelligence to develop. The Neural Network will also output a float in the range $(-1, 1)$ telling us how likely it thinks we will win or lose the game. The first component is a Neural Network while the second component is the Monte Carlo Tree Search (MCTS)
This implementation is largely inspired from the non official minigo implementation. Well, the best you can do is to add all the numbers and you’ll get $5 + 8 + 12 + 25 + 3 = 53$ which is very far from Here again, as it is a Neural Network, we will need to train it on lot’s of data… Like millions and millions of games… One idea would be to use a database that contains the very best games from the best Go players in the world. The first part of the That’s it. It thus makes the Neural Network more robust.Another trick used by DeepMind is to parallelize the training of the Neural Network. I see that you’re asking yourself the right questions! According to the previous argument we can just solve:Hence, once we have selected the bad actions $19$ times each, we are Hence, we will associate a higher probability to the actions that have been selected the most during the To train our Neural Network we will use the data generated during the self-play games. Figure I.2: Basic input and output for our Neural Network for a $9 \times 9$ board game. AlphaGo Zero的超参数使用贝叶斯优化来选择,而AlphaZero重用AlphaGo Zero的超参数,对于象棋也使用围棋的超参数; 图解.
While doing the research I have stumbled upon the AlphaGo Zero course by Depth First Learning group. I won’t enter into the detail of why it is named like this.
Let’s say we want to simulate $1600$ MCTS expansions. That is to say that the value $P(s, a)$ returned by the neural network will be stored in the child nodes of the node $13$ (colored in black in the above figure).I didn’t represented it on the above picture, but, since we selected the action $13$, we have:Since, in the very beginning, $\forall i$, $N(s, a_i) = W(s, a_i) = 0$Let’s now see what happens during the $j^{th}$ iteration of the algorithm.