Future Work - Evaluation and Conclusion - Deep reinforcement Learning Using Monte-Carlo Tree Se

Evaluation and Conclusion

6.3 Future Work

Some changes could be made to the system to speed it up further. One possibility is to implement threaded MCTS in e.g. C++. This would allow searches in parallel within each self-play game and batched network evaluation without the overhead of processes. If the system could be distributed across more GPUs, this would also be helpful in increasing the ratio between self-play games and training iterations. It could for example be modified to run on NTNU’s Idun cluster. With a more computationally performant system, it would then be easier to test various hyperparameters to properly tune the system. This could also allow for larger networks, bigger boards, more simulations and training the networks for longer.

To properly assess how well the system learns, the resulting agents could be compared to those of prior work in the field, e.g. MoHex. This could involve straight comparisons of win ratios, but also a deeper analysis of playstyles to identify strengths and weaknesses.

To get more conclusive data on the first research question, the number of sim-ulations could be increased and various weightings could be tested in experiment 2 to see any of them made random rollouts beneficial. For the second research question, experiment 3 could be repeated with a better trained network and a fine-tuning of parameters such as the exploration constant. An agent without rollouts could also be included in the comparison. Improving these experiments would be far more doable if the speed of the system was improved beforehand.

If an alternative, novel rollout policy that both preserves generality and im-proves performance could be formulated, this would be a large contribution to the field. This might require e.g. a deep study of prior art to see if there has been related work in general policies that could be applicable.

Bibliography

Allis, L. (1994). Searching for solutions in games and artificial intelligence. PhD thesis, Maastricht University.

Anthony, T., Tian, Z., and Barber, D. (2017). Thinking fast and slow with deep learning and tree search. InAdvances in Neural Information Processing Systems, pages 5360–5370.

Arneson, B., Hayward, R. B., and Henderson, P. (2010). Monte carlo tree search in hex. IEEE Transactions on Computational Intelligence and AI in Games, 2(4):251–258.

Bouzy, B. (2004). Associating shallow and selective global tree search with monte carlo for 9×9 go. InInternational Conference on Computers and Games, pages 67–80. Springer.

Browne, C. (2000). Hex Strategy: Making the Right Connections. Ak Peters Series. Taylor & Francis.

Campbell, M., Hoane Jr, A. J., and Hsu, F.-h. (2002). Deep blue. Artificial intelligence, 134(1-2):57–83.

Clark, C. and Storkey, A. (2015). Training deep convolutional neural networks to play go. InInternational conference on machine learning, pages 1766–1774.

Coulom, R. (2006). Efficient selectivity and backup operators in monte-carlo tree search. InInternational conference on computers and games, pages 72–83.

Springer.

Enzenberger, M., Muller, M., Arneson, B., and Segal, R. (2010). Fuego—an open-source framework for board games and go engine based on monte carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2(4):259–270.

Gale, D. (1979). The game of hex and the brouwer fixed-point theorem. The American Mathematical Monthly, 86(10):818–827.

Gelly, S., Kocsis, L., Schoenauer, M., Sebag, M., Silver, D., Szepesv´ari, C., and Teytaud, O. (2012). The grand challenge of computer go: Monte carlo tree search and extensions. Communications of the ACM, 55(3):106–113.

Gelly, S. and Silver, D. (2007). Combining online and offline knowledge in uct.

InProceedings of the 24th international conference on Machine learning, pages 273–280. ACM.

Gelly, S. and Silver, D. (2008). Achieving master level play in 9 x 9 computer go.

InAAAI, volume 8, pages 1537–1540.

Gelly, S. and Silver, D. (2011). Monte-carlo tree search and rapid action value estimation in computer go. Artificial Intelligence, 175(11):1856–1875.

Goodfellow, I., Bengio, Y., and Courville, A. (2016).Deep Learning. MIT Press.

http://www.deeplearningbook.org.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.

Hingston, P. and Masek, M. (2007). Experiments with monte carlo othello. In 2007 IEEE Congress on Evolutionary Computation, pages 4059–4064. IEEE.

Kocsis, L. and Szepesv´ari, C. (2006). Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. InAdvances in neural information processing systems, pages 1097–1105.

Landau, T. (1985). Othello: Brief & basic. US Othello Association, 920:22980–

23425.

Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Tech-nical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science.

Lundgaard, N. and McKee, B. (2006). Reinforcement learning and neural net-works for tetris. Technical report, University of Oklahoma.

Maddison, C. J., Huang, A., Sutskever, I., and Silver, D. (2014). Move evaluation in go using deep convolutional neural networks. arXiv preprint arXiv:1412.6564.

BIBLIOGRAPHY 73 Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al.

(2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.

Nash, J. (1952). Some games and machines for playing them. Technical Report D-1164, Rand Corp.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in PyTorch. InNIPS Autodiff Workshop.

Python contributors (2019a). Pipes and queues. https://docs.python.org/

3.8/library/multiprocessing.html#pipes-and-queues. [Online; accessed 13-January-2020].

Python contributors (2019b). Thread state and the global in-terpreter lock. https://docs.python.org/3/c-api/init.html#

thread-state-and-the-global-interpreter-lock. [Online; accessed 13-January-2020].

PyTorch contributors (2019). Multiprocessing best practices.https://pytorch.

org/docs/stable/notes/multiprocessing.html. [Online; accessed 22-January-2020].

Rosin, C. D. (2011). Multi-armed bandits with episode context. Annals of Math-ematics and Artificial Intelligence, 61(3):203–230.

Ross, S., Gordon, G., and Bagnell, D. (2011). A reduction of imitation learn-ing and structured prediction to no-regret online learnlearn-ing. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635.

Russell, S. J. and Norvig, P. (2010). Artificial intelligence: a modern approach.

Malaysia; Pearson Education Limited,.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.

(2016). Mastering the game of go with deep neural networks and tree search.

Nature, 529(7587):484.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanc-tot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforce-ment learning algorithm that masters chess, shogi, and go through self-play.

Science, 362(6419):1140–1144.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676):354.

Wang, Y. and Gelly, S. (2007). Modifications of uct and sequence-like simulations for monte-carlo go. In 2007 IEEE Symposium on Computational Intelligence and Games, pages 175–182. IEEE.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for con-nectionist reinforcement learning. Machine learning, 8(3-4):229–256.

Appendix A

In document Deep reinforcement Learning Using Monte-Carlo Tree Search for Hex and Othello (sider 82-87)