Solving the Rubik's Cube Without Human Knowledge

source

arxiv

source_type

latex

converted_with

pandoc

paper_version

1805.07470v1

title

Solving the Rubik's Cube Without Human Knowledge

authors

["Stephen McAleer","Forest Agostinelli","Alexander Shmakov","Pierre Baldi"]

date_published

2018-05-18 23:07:31+00:00

data_last_modified

2018-05-18 23:07:31+00:00

abstract

A generally intelligent agent must be able to teach itself how to solve problems in complex domains with minimal human supervision. Recently, deep reinforcement learning algorithms combined with self-play have achieved superhuman proficiency in Go, Chess, and Shogi without human data or domain knowledge. In these environments, a reward is always received at the end of the game, however, for many combinatorial optimization environments, rewards are sparse and episodes are not guaranteed to terminate. We introduce Autodidactic Iteration: a novel reinforcement learning algorithm that is able to teach itself how to solve the Rubik's Cube with no human assistance. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves -- less than or equal to solvers that employ human domain knowledge.

author_comment

First three authors contributed equally. Submitted to NIPS 2018

journal_ref

null

doi

null

primary_category

cs.AI

categories

["cs.AI"]

citation_level

alignment_text

unlabeled

confidence_score

0.8631314971

main_tex_filename

./main.tex

bibliography_bbl

\begin{thebibliography}{10} \bibitem{ruwix} How to solve the rubik's cube - beginners method. \newblock \url{https://ruwix.com/the-rubiks-cube/how-to-solve-the-rubiks-cube-beginners-method/}. \bibitem{anthony_2017} Thomas Anthony, Zheng Tian, and David Barber. \newblock Thinking fast and slow with deep learning and tree search, 2017. \bibitem{bellman1957dynamic} Richard Bellman. \newblock {\em Dynamic Programming}. \newblock Princeton University Press, 1957. \bibitem{bello} Irwan Bello, Hieu Pham, Quoc~V. Le, Mohammad Norouzi, and Samy Bengio. \newblock Neural combinatorial optimization with reinforcement learning, 2016. \bibitem{bottou_2011} Leon Bottou. \newblock From machine learning to machine reasoning, 2011. \bibitem{brown_2017} Andrew Brown. \newblock Rubik's cube solver. \newblock \url{https://github.com/brownan/Rubiks-Cube-Solver}, 2017. \bibitem{Brunetto_2017} Robert Brunetto and Otakar Trunda. \newblock Deep heuristic-learning in the rubik’s cube domain: an experimental evaluation. \newblock 2017. \bibitem{coulom2006efficient} R{\'e}mi Coulom. \newblock Efficient selectivity and backup operators in monte-carlo tree search. \newblock In {\em International conference on computers and games}, pages 72--83. Springer, 2006. \bibitem{herik_2007} R{\'e}mi Coulom. \newblock Efficient selectivity and backup operators in monte-carlo tree search. \newblock In H.~Jaap van~den Herik, Paolo Ciancarini, and H.~H. L. M.~(Jeroen) Donkers, editors, {\em Computers and Games}, pages 72--83, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. \bibitem{goodfellow2016deep} Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. \newblock {\em Deep learning}, volume~1. \newblock MIT press Cambridge, 2016. \bibitem{guo_2014} Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard~L Lewis, and Xiaoshi Wang. \newblock Deep learning for real-time atari game play using offline monte-carlo tree search planning. \newblock In Z.~Ghahramani, M.~Welling, C.~Cortes, N.~D. Lawrence, and K.~Q. Weinberger, editors, {\em Advances in Neural Information Processing Systems 27}, pages 3338--3346. Curran Associates, Inc., 2014. \bibitem{hinton_2014} Geoffrey Hinton, Nitsh Srivastava, and Kevin Swersky. \newblock Overview of mini-batch gradient descent, 2014. \bibitem{kociemba_2018} Herbert Kociemba. \newblock Two-phase algorithm details. \newblock \url{http://kociemba.org/math/imptwophase.htm}. \bibitem{kocsis2006bandit} Levente Kocsis and Csaba Szepesv{\'a}ri. \newblock Bandit based monte-carlo planning. \newblock In {\em European conference on machine learning}, pages 282--293. Springer, 2006. \bibitem{korf_1997} Richard~E. Korf. \newblock Finding optimal solutions to rubik's cube using pattern databases. \newblock In {\em Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence}, AAAI'97/IAAI'97, pages 700--705. AAAI Press, 1997. \bibitem{kunkle_2007} Daniel Kunkle and Gene Cooperman. \newblock Twenty-six moves suffice for rubik's cube. \newblock In {\em Proceedings of the 2007 International Symposium on Symbolic and Algebraic Computation}, ISSAC '07, pages 235--242, New York, NY, USA, 2007. ACM. \bibitem{lecun2015deep} Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. \newblock Deep learning. \newblock {\em nature}, 521(7553):436, 2015. \bibitem{lichodzijewski2011rubik} Peter Lichodzijewski and Malcolm Heywood. \newblock The rubik cube and gp temporal sequence learning: an initial study. \newblock In {\em Genetic Programming Theory and Practice VIII}, pages 35--54. Springer, 2011. \bibitem{mnih2016asynchronous} Volodymyr Mnih, Adria~Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy~P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. \newblock Asynchronous methods for deep reinforcement learning. \newblock In {\em International Conference on Machine Learning (ICML)}, 2016. \bibitem{mnih2015human} Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei~A Rusu, Joel Veness, Marc~G Bellemare, Alex Graves, Martin Riedmiller, Andreas~K Fidjeland, Georg Ostrovski, et~al. \newblock Human-level control through deep reinforcement learning. \newblock {\em Nature}, 518(7540):529--533, 2015. \bibitem{moriarty1999evolutionary} David~E Moriarty, Alan~C Schultz, and John~J Grefenstette. \newblock Evolutionary algorithms for reinforcement learning. \newblock {\em Journal of Artificial Intelligence Research}, 11:241--276, 1999. \bibitem{radu_2007} Silviu Radu. \newblock Rubik's cube can be solved in 34 quarter turns. \newblock \url{http://cubezzz.dyndns.org/drupal/?q=node/view/92}, Jul 2007. \bibitem{reid_1995} Michael Reid. \newblock Superflip requires 20 face turns. \newblock \url{http://www.math.rwth-aachen.de/~Martin.Schoenert/Cube-Lovers/michael_reid__superflip_requires_20_face_turns.html}, Jan 1995. \bibitem{rokicki2010twenty} Tomas Rokicki. \newblock Twenty-two moves suffice for rubik’s cube{\textregistered}. \newblock {\em The Mathematical Intelligencer}, 32(1):33--40, 2010. \bibitem{rokicki_2014} Tomas Rokicki. \newblock God's number is 26 in the quarter-turn metric. \newblock \url{http://www.cube20.org/qtm/}, Aug 2014. \bibitem{rokicki2014diameter} Tomas Rokicki, Herbert Kociemba, Morley Davidson, and John Dethridge. \newblock The diameter of the rubik's cube group is twenty. \newblock {\em SIAM Review}, 56(4):645--670, 2014. \bibitem{segal_2011} Richard~B. Segal. \newblock On the scalability of parallel uct. \newblock In H.~Jaap van~den Herik, Hiroyuki Iida, and Aske Plaat, editors, {\em Computers and Games}, pages 36--47, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. \bibitem{silver_2016} David Silver, Aja Huang, Christopher~J. Maddison, Arthur Guez, Laurent Sifre, George van~den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. \newblock Mastering the game of go with deep neural networks and tree search. \newblock {\em Nature}, 529:484--503, 2016. \bibitem{silver2017mastering2} David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et~al. \newblock Mastering chess and shogi by self-play with a general reinforcement learning algorithm. \newblock {\em arXiv preprint arXiv:1712.01815}, 2017. \bibitem{silver_2017} David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van~den Driessche, Thore Graepel, and Demis Hassabis. \newblock Mastering the game of go without human knowledge. \newblock {\em Nature}, 550(7676):354--359, 10 2017. \bibitem{silver2017mastering1} David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et~al. \newblock Mastering the game of go without human knowledge. \newblock {\em Nature}, 550(7676):354, 2017. \bibitem{smith2016discovering} Robert~J Smith, Stephen Kelly, and Malcolm~I Heywood. \newblock Discovering rubik's cube subgroups using coevolutionary gp: A five twist experiment. \newblock In {\em Proceedings of the Genetic and Evolutionary Computation Conference 2016}, pages 789--796. ACM, 2016. \bibitem{sutton_1988} Richard~S. Sutton. \newblock Learning to predict by the methods of temporal differences. \newblock {\em Machine Learning}, 3(1):9--44, Aug 1988. \bibitem{sutton1998reinforcement} Richard~S Sutton and Andrew~G Barto. \newblock {\em Reinforcement learning: An introduction}, volume~1. \newblock MIT press Cambridge, 1998. \bibitem{tesauro1995temporal} Gerald Tesauro. \newblock Temporal difference learning and td-gammon. \newblock {\em Communications of the ACM}, 38(3):58--68, 1995. \bibitem{thistlethwaite_1981} Morwen Thistlethwaite. \newblock Thistlethwaite's 52-move algorithm. \newblock \url{https://www.jaapsch.net/puzzles/thistle.htm}, Jul 1981. \bibitem{tsoy_2018} Maxim Tsoy. \newblock Kociemba. \newblock \url{https://github.com/muodov/kociemba}, 2018. \bibitem{watkins1992q} Christopher~JCH Watkins and Peter Dayan. \newblock Q-learning. \newblock {\em Machine learning}, 8(3-4):279--292, 1992. \bibitem{williams1992simple} Ronald~J Williams. \newblock Simple statistical gradient-following algorithms for connectionist reinforcement learning. \newblock In {\em Reinforcement Learning}, pages 5--32. Springer, 1992. \end{thebibliography}

bibliography_bib

arxiv_citations

{"1712.01815":true}

alignment_newsletter

{"source":"alignment-newsletter","source_type":"google-sheets","converted_with":"python","venue":"arXiv","newsletter_category":"Reinforcement learning","highlight":true,"newsletter_number":"AN #8","newsletter_url":"https://mailchi.mp/15397ab6f65c/alignment-newsletter-8","summarizer":"Rohin","summary":"This paper proposes _Autodidactic Iteration_ (ADI), which is a technique that can be combined with the techniques in AlphaGo and expert iteration to solve problems with only one goal state, such as the Rubik's cube. MCTS with value and policy networks will not suffice, because when starting from a randomly scrambled cube, MCTS will never find a path to the goal state, and so there will never be any reward signal. (Whereas with Go, even if you play randomly the game will end relatively quickly, giving you some reward signal.) To get around this, they start _from the goal state_ and generate states that are near the goal state. This gives them a training dataset of states for which they know (a good approximation to) the value and the best action, which they can use to train a value and policy network. They then use this with MCTS to solve the full problem, as in AlphaGo.","opinion":"This general idea has been proposed in robotics as well, in [Reverse Curriculum Generation for Reinforcement Learning](https://arxiv.org/abs/1707.05300), where there is a single goal state. However, in this setting we have the added benefit of perfect inverse dynamics, that is, for any action a that moves us from state s to s', we can find the inverse action a' that moves us from state s' to s. This allows the authors to start from the goal state, generate nearby states, and automatically know the value of those states (or at least a very good approximation to it). [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495) also tackles similar issues -- I'd be interested to see if it could solve the Rubik's cube. Overall, the problem of sparse rewards is very difficult, and it seems like we now have another solution in the case where we have a single goal state and perfect (or perhaps just sufficiently good?) inverse dynamics.","prerequisites":"nan","read_more":"nan","paper_version":"1805.07470v1","arxiv_id":"1805.07470","title":"Solving the Rubik's Cube Without Human Knowledge","authors":["Stephen McAleer","Forest Agostinelli","Alexander Shmakov","Pierre Baldi"],"date_published":"2018-05-18 23:07:31+00:00","data_last_modified":"2018-05-18 23:07:31+00:00","url":"http://arxiv.org/abs/1805.07470v1","abstract":"A generally intelligent agent must be able to teach itself how to solve problems in complex domains with minimal human supervision. Recently, deep reinforcement learning algorithms combined with self-play have achieved superhuman proficiency in Go, Chess, and Shogi without human data or domain knowledge. In these environments, a reward is always received at the end of the game, however, for many combinatorial optimization environments, rewards are sparse and episodes are not guaranteed to terminate. We introduce Autodidactic Iteration: a novel reinforcement learning algorithm that is able to teach itself how to solve the Rubik's Cube with no human assistance. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves -- less than or equal to solvers that employ human domain knowledge.","author_comment":"First three authors contributed equally. Submitted to NIPS 2018","journal_ref":"None","doi":"None","primary_category":"cs.AI","categories":"['cs.AI']","individual_summary":"Title: Solving the Rubik's Cube Without Human Knowledge\nAuthors: Stephen McAleer, Forest Agostinelli, Alexander Shmakov, Pierre Baldi\nPaper abstract: A generally intelligent agent must be able to teach itself how to solve problems in complex domains with minimal human supervision. Recently, deep reinforcement learning algorithms combined with self-play have achieved superhuman proficiency in Go, Chess, and Shogi without human data or domain knowledge. In these environments, a reward is always received at the end of the game, however, for many combinatorial optimization environments, rewards are sparse and episodes are not guaranteed to terminate. We introduce Autodidactic Iteration: a novel reinforcement learning algorithm that is able to teach itself how to solve the Rubik's Cube with no human assistance. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves -- less than or equal to solvers that employ human domain knowledge.\nSummary: This paper proposes _Autodidactic Iteration_ (ADI), which is a technique that can be combined with the techniques in AlphaGo and expert iteration to solve problems with only one goal state, such as the Rubik's cube. MCTS with value and policy networks will not suffice, because when starting from a randomly scrambled cube, MCTS will never find a path to the goal state, and so there will never be any reward signal. (Whereas with Go, even if you play randomly the game will end relatively quickly, giving you some reward signal.) To get around this, they start _from the goal state_ and generate states that are near the goal state. This gives them a training dataset of states for which they know (a good approximation to) the value and the best action, which they can use to train a value and policy network. They then use this with MCTS to solve the full problem, as in AlphaGo.\nMy opinion: This general idea has been proposed in robotics as well, in [Reverse Curriculum Generation for Reinforcement Learning](https://arxiv.org/abs/1707.05300), where there is a single goal state. However, in this setting we have the added benefit of perfect inverse dynamics, that is, for any action a that moves us from state s to s', we can find the inverse action a' that moves us from state s' to s. This allows the authors to start from the goal state, generate nearby states, and automatically know the value of those states (or at least a very good approximation to it). [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495) also tackles similar issues -- I'd be interested to see if it could solve the Rubik's cube. Overall, the problem of sparse rewards is very difficult, and it seems like we now have another solution in the case where we have a single goal state and perfect (or perhaps just sufficiently good?) inverse dynamics.","paper_text":"","text":"Highlights\n**[Solving the Rubik's Cube Without Human Knowledge](https://arxiv.org/abs/1805.07470)** *(Stephen McAleer, Forest Agostinelli, Alexander Shmakov et al)*: This paper proposes *Autodidactic Iteration* (ADI), which is a technique that can be combined with the techniques in AlphaGo and expert iteration to solve problems with only one goal state, such as the Rubik's cube. MCTS with value and policy networks will not suffice, because when starting from a randomly scrambled cube, MCTS will never find a path to the goal state, and so there will never be any reward signal. (Whereas with Go, even if you play randomly the game will end relatively quickly, giving you some reward signal.) To get around this, they start *from the goal state* and generate states that are near the goal state. This gives them a training dataset of states for which they know (a good approximation to) the value and the best action, which they can use to train a value and policy network. They then use this with MCTS to solve the full problem, as in AlphaGo.\n**My opinion:** This general idea has been proposed in robotics as well, in [Reverse Curriculum Generation for Reinforcement Learning](https://arxiv.org/abs/1707.05300), where there is a single goal state. However, in this setting we have the added benefit of perfect inverse dynamics, that is, for any action a that moves us from state s to s', we can find the inverse action a' that moves us from state s' to s. This allows the authors to start from the goal state, generate nearby states, and automatically know the value of those states (or at least a very good approximation to it). [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495) also tackles similar issues -- I'd be interested to see if it could solve the Rubik's cube. Overall, the problem of sparse rewards is very difficult, and it seems like we now have another solution in the case where we have a single goal state and perfect (or perhaps just sufficiently good?) inverse dynamics.\n**[Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior](https://arxiv.org/abs/1805.08010)** *(Siddharth Reddy et al)*: Inverse reinforcement learning algorithms typically assume that the demonstrations come from an expert who is approximately optimal. However, this is often not the case, at least when the experts are fallible humans. This paper considers the case where the expert has an incorrect model of the dynamics (transition function) of the environment, and proposes learning the expert's model of the dynamics to improve reward function inference. However, this leads to severe unidentifiability problems, where many models of the dynamics are compatible with the observed behavior. To overcome this, they assume that they have multiple tasks with known reward functions, which they use to infer the expert's dynamics. This is then used to infer the reward function in a new task using an adaptation of max causal entropy IRL. The dynamics can be an arbitrary neural net while the reward function is a weighted linear combination of features. They evaluate the inference of the dynamics model with real humans on Lunar Lander. Given transcripts of humans playing Lunar Lander, they infer the underlying (incorrect) dynamics model. Then, when the human takes an action, they predict which next state the human wanted to achieve, and replace the human's action with the action that would actually get close to the state the human wanted.\n**My opinion:** I really like that this paper has experiments with real humans. It's definitely a problem that IRL assumes that the expert is (approximately) optimal -- this means that you can't learn where the expert is likely to be wrong, and so it is hard to exceed the expert's performance. It's very difficult to figure out how to deal with the possbility of a biased expert, and I'm happy to see work that takes a shot at it.\n \n\nTechnical AI alignment\n \n\nProblems\n[How the Enlightenment Ends](https://www.theatlantic.com/magazine/archive/2018/06/henry-kissinger-ai-could-mean-the-end-of-human-history/559124/) *(Henry A. Kissinger)*: This is an article about the dangers of AI written by a non-technologist, hitting some points that are relatively familiar.\n**My opinion:** While there are many points that I disagree with (eg. \"what [AIs] do uniquely is not thinking as heretofore conceived and experienced. Rather, it is unprecedented memorization and computation\"), overall there was a surprising amount of familiar material said in a different way (such as explainability and unintended consequences).\n \n\nLearning human intent\n**[Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior](https://arxiv.org/abs/1805.08010)** *(Siddharth Reddy et al)*: Summarized in the highlights!\n[A Framework and Method for Online Inverse Reinforcement Learning](http://arxiv.org/abs/1805.07871) *(Saurabh Arora et al)*: This paper introduces Incremental Inverse Reinforcement Learning (I2RL), where the agent continually gets new demonstrations from an expert, and has to update the estimate of the reward function in real time. The running example is a robot that has to navigate to a goal location without being seen by two guards that are patrolling. The robot needs to infer the rewards of the two guards in order to predict what they will do and plan around them. Since the guards are sometimes out of sight, we get demonstrations *with occlusion*, that is, some of the states in the demonstrations are hidden.\nIn the batch setting, this is solved with Latent Maximum Entropy IRL. To deal with occluded states Z, we define a probability distribution Pr(Z | Y, θ), where Y is the visible states and θ is the reward weights. Then, you can use expectation maximization to find θ -- in the expectation step, you compute feature expectations of the demonstrations (taking an expectation over hidden states Z), and in the maximization step, you compute θ using the feature expectations as in standard maximum entropy IRL. The authors show how to extend this algorithm to the incremental setting where you only keep the reward weights, the feature expectations, and the number of past demonstrations as statistics. They show some convergence guarantees and evaluate on their running example of a robot that must evade guards.\n**My opinion:** IRL algorithms are often more computationally expensive than state-of-the-art RL algorithms, so I'm happy to see work that's trying to make it more realistic. That said, this paper focuses on settings where IRL is used to infer other agent's preferences so we can plan around them (as opposed to imitation learning) -- this setting seems not very important for AI alignment. I'm also very confused by the experiments -- it seems in Figure 2 that if you ignore previous optimization and initialize the reward with random weights, it does better. (It isn't ignoring all previous data, because it still has access to past feature expectations.) They don't comment on this in the paper, but my guess is that they ran more iterations of expectation maximization (which is why the learning duration is higher) and that's why they got better performance.\n[Imitating Latent Policies from Observation](https://arxiv.org/abs/1805.07914) *(Ashley D. Edwards et al)*\n[Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications](http://arxiv.org/abs/1805.07687) *(Daniel S. Brown et al)*\n[Maximum Causal Tsallis Entropy Imitation Learning](http://arxiv.org/abs/1805.08336) *(Kyungjae Lee et al)*\n[Planning to Give Information in Partially Observed Domains with a Learned Weighted Entropy Model](http://arxiv.org/abs/1805.08263) *(Rohan Chitnis et al)*\n[Safe Policy Learning from Observations](http://arxiv.org/abs/1805.07805) *(Elad Sarafian et al)*\n \n\nHandling groups of agents\n[Learning to Teach in Cooperative Multiagent Reinforcement Learning](http://arxiv.org/abs/1805.07830) *(Shayegan Omidshafiei et al)*\n \n\nInterpretability\n[Unsupervised Learning of Neural Networks to Explain Neural Networks](http://arxiv.org/abs/1805.07468) *(Quanshi Zhang et al)*\n \n\nVerification\n[Verifiable Reinforcement Learning via Policy Extraction](http://arxiv.org/abs/1805.08328) *(Osbert Bastani et al)*: Since it is hard to verify properties of neural nets, we can instead first train a decision tree policy to mimic the policy learned by deep RL, and then verify properties about that. The authors generalize [DAGGER](https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf) to take advantage of the Q-function and extract decision tree policies. They then prove a correctness guarantee for a toy version of Pong (where the dynamics are known), a robustness guarantee for Pong (with symbolic states, not pixels) (which can be done without known dynamics), and stability of cartpole.\n**My opinion:** Many people believe that ultimately we will need to prove theorems about the safety of our AIs. I don't understand yet what kind of theorems they have in mind, so I don't really want to speculate on how this relates to it. It does seem like the robustness guarantee is the most relevant one, since in general we won't have access to a perfect model of the dynamics.\n \n\nMiscellaneous (Alignment)\n[When is unaligned AI morally valuable?](https://www.lesswrong.com/posts/3kN79EuT27trGexsq/when-is-unaligned-ai-morally-valuable) *(Paul Christiano)*: When might it be a good idea to hand the keys to the universe to an unaligned AI? This post looks more deeply at this question, which could be important as a backup plan if we don't think we can build an aligned AI. I can't easily summarize this, so you'll have to read the post.\n[A Psychopathological Approach to Safety Engineering in AI and AGI](https://arxiv.org/abs/1805.08915) *(Vahid Behzadan et al)*: Since AGI research aims for cognitive functions that are similar to humans, they will be vulnerable to similar psychological issues. Some problems can be recast in this light -- for example, wireheading can be thought of as delusional or addictive behavior. This framework suggests new solutions to AI safety issues -- for example, analogous to behavioral therapy, we can retrain a malfunctioning agent in controlled environments to remove the negative effects of earlier experiences.\n**My opinion:** The analogy is interesting but I'm not sure what to take away from the paper, and I think there are also big disanalogies. The biggest one is that we have to communicate our goals to an AI, whereas humans come equipped with some goals from birth (though arguably most of our goals come from the environment we grow up in). I'd be interested in seeing future work from this agenda, since I don't know how I could do work on the agenda laid out in this paper.\n \n\nAI strategy and policy\n[2018 White House Summit on Artificial Intelligence for American Industry](https://www.whitehouse.gov/wp-content/uploads/2018/05/Summary-Report-of-White-House-AI-Summit.pdf?latest) *(White House OSTP)*: See [Import AI](https://jack-clark.net/2018/05/22/import-ai-95-learning-to-predict-and-avoid-internet-arguments-with-deep-learning-white-house-announces-select-committee-on-ai-and-bmw-trains-cars-to-safely-change-lanes/)\n[France, China, and the EU All Have an AI Strategy. Shouldn’t the US?](https://www.wired.com/story/the-us-needs-an-ai-strategy/) *(John K. Delaney)*: See [Import AI](https://jack-clark.net/2018/05/22/import-ai-95-learning-to-predict-and-avoid-internet-arguments-with-deep-learning-white-house-announces-select-committee-on-ai-and-bmw-trains-cars-to-safely-change-lanes/)\n**Read more:** [FUTURE of AI Act](https://www.cantwell.senate.gov/news/press-releases/cantwell-bipartisan-colleagues-introduce-bill-to-further-understand-and-promote-development-of-artificial-intelligence-drive-economic-opportunity)\n \n\nAI capabilities\n \n\nReinforcement learning\n**[Solving the Rubik's Cube Without Human Knowledge](https://arxiv.org/abs/1805.07470)** *(Stephen McAleer, Forest Agostinelli, Alexander Shmakov et al)*: Summarized in the highlights!\n[Gym Retro, again](https://blog.openai.com/gym-retro/) *(Vicki Pfau et al)*: OpenAI is releasing the full version of Gym Retro, with over a thousand games, and a tool for integrating new games into the framework. And of course we see new games in which RL agents find infinite loops that give them lots of reward -- Cheese Cat-Astrophe and Blades of Vengeance.\n[Feedback-Based Tree Search for Reinforcement Learning](https://arxiv.org/abs/1805.05935) *(Daniel R. Jiang et al)*: See [Import AI](https://jack-clark.net/2018/05/22/import-ai-95-learning-to-predict-and-avoid-internet-arguments-with-deep-learning-white-house-announces-select-committee-on-ai-and-bmw-trains-cars-to-safely-change-lanes/)\n[Evolutionary Reinforcement Learning](https://arxiv.org/abs/1805.07917v1) *(Shauharda Khadka et al)*\n[Learning Time-Sensitive Strategies in Space Fortress](https://arxiv.org/abs/1805.06824) *(Akshat Agarwal et al)*\n[Learning Real-World Robot Policies by Dreaming](https://arxiv.org/abs/1805.07813) *(AJ Piergiovanni et al)*\n[Episodic Memory Deep Q-Networks](http://arxiv.org/abs/1805.07603) *(Zichuan Lin et al)*\n \n\nMeta learning\n[Meta-learning with differentiable closed-form solvers](https://arxiv.org/abs/1805.08136) *(Luca Bertinetto et al)*\n[Task-Agnostic Meta-Learning for Few-shot Learning](http://arxiv.org/abs/1805.07722) *(Muhammad Abdullah Jamal et al)*\n \n\nHierarchical RL\n[Hierarchical Reinforcement Learning with Deep Nested Agents](https://arxiv.org/abs/1805.07008) *(Marc Brittain et al)*\n[Hierarchical Reinforcement Learning with Hindsight](https://arxiv.org/abs/1805.08180) *(Andrew Levy et al)*\n[Data-Efficient Hierarchical Reinforcement Learning](http://arxiv.org/abs/1805.08296) *(Ofir Nachum et al)*\n \n\nMiscellaneous (Capabilities)\n[The Blessings of Multiple Causes](https://arxiv.org/abs/1805.06826) *(Yixin Wang et al)* |\n\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| [If you got forwarded this email, you can sign up here!](http://eepurl.com/dqMSZj \"If you got forwarded this email, you can sign up here!\") |\n\n |\n\n |\n| \n\n| | | | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n| | | | | | | | | | |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| \n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n |\n\n\n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n |\n\n\n\n\n| | | |\n| --- | --- | --- |\n| \n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n |\n\n\n |\n\n |\n\n |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n| |\n| --- |\n| |\n\n |\n\n\n\n| | |\n| --- | --- |\n| \n\n\n| |\n| --- |\n| *Copyright © 2018 Rohin Shah, All rights reserved.*\n\n\n"}

abstract: | A generally intelligent agent must be able to teach itself how to solve problems in complex domains with minimal human supervision. Recently, deep reinforcement learning algorithms combined with self-play have achieved superhuman proficiency in Go, Chess, and Shogi without human data or domain knowledge. In these environments, a reward is always received at the end of the game; however, for many combinatorial optimization environments, rewards are sparse and episodes are not guaranteed to terminate. We introduce Autodidactic Iteration: a novel reinforcement learning algorithm that is able to teach itself how to solve the Rubik's Cube with no human assistance. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves --- less than or equal to solvers that employ human domain knowledge. author:

| Stephen McAleer[^1]
Department of Statistics
University of California, Irvine
smcaleer@uci.edu
Forest Agostinelli
Department of Computer Science
University of California, Irvine
fagostin@uci.edu
Alexander Shmakov
Department of Computer Science
University of California, Irvine
ashmakov@uci.edu
Pierre Baldi
Department of Computer Science
University of California, Irvine
pfbaldi@ics.uci.edu
bibliography:
mybib.bib title: Solving the Rubik's Cube Without Human Knowledge

Introduction

Reinforcement learning seeks to create intelligent agents that adapt to an environment by analyzing their own experiences. Reinforcement learning agents have achieved superhuman capability in a number of competitive games [@tesauro1995temporal; @mnih2015human; @silver2017mastering1; @silver2017mastering2]. This recent success of reinforcement learning has been a product of the combination of classic reinforcement learning [@bellman1957dynamic; @sutton_1988; @watkins1992q; @sutton1998reinforcement], deep learning [@lecun2015deep; @goodfellow2016deep], and Monte Carlo Tree Search (MCTS) [@coulom2006efficient; @kocsis2006bandit; @herik_2007]. The fusion of reinforcement learning and deep learning is known as deep reinforcement learning (DRL). Though DRL has been successful in many different domains, it relies heavily on the condition that an informatory reward can be obtained from an initially random policy. Current DRL algorithms struggle in environments with a high number of states and a small number of reward states such as Sokoban and Montezuma's Revenge. Other environments, such as short answer exam problems, prime number factorization, and combination puzzles like the Rubik's Cube have a large state space and only a single reward state.

The 3x3x3 Rubik's cube is a classic 3-Dimensional combination puzzle. There are 6 faces, or 3x3x1 planes, which can be rotated $90^{\circ}$ in either direction. The goal state is reached when all stickers on each face of the cube are the same color, as shown in Figures 2{reference-type="ref" reference="fig:solvedTop"} and 3{reference-type="ref" reference="fig:solvedBottom"}. The Rubik's cube has a large state space, with approximately $4.3 \times 10^{19}$ different possible configurations. However, out of all of these configurations, only one state has a reward signal: the goal state. Therefore, starting from random states and applying DRL algorithms, such as asynchronous advantage actor-critic (A3C) [@mnih2016asynchronous], could theoretically result in the agent never solving the cube and never receiving a learning signal.

To overcome a sparse reward in a model-based environment with a large state space, we introduce Autodidactic Iteration (ADI): an algorithm inspired by policy iteration [@bellman1957dynamic; @sutton1998reinforcement] for training a joint value and policy network. ADI trains the value function through an iterative supervised learning process. In each iteration, the inputs to the neural network are created by starting from the goal state and randomly taking actions. The targets seek to estimate the optimal value function by performing a breadth-first search from each input state and using the current network to estimate the value of each of the leaves in the tree. Updated value estimates for the root nodes are obtained by recursively backing up the values for each node using a max operator. The policy network is similarly trained by constructing targets from the move that maximizes the value. After the network is trained, it is combined with MCTS to efficiently solve the Rubik's Cube. We call the resulting solver DeepCube.

An illustration of DeepCube. The training and solving process is split up into ADI and MCTS. First, we iteratively train a DNN by estimating the true value of the input states using breadth-first search. Then, using the DNN to guide exploration, we solve cubes using Monte Carlo Tree Search. See methods section for more details. {#fig:overview width="99%"}

Related work

Erno Rubik created the Rubik's Cube in 1974. After a month of effort, he came up with the first algorithm to solve the cube. Since then, the Rubik's Cube has gained worldwide popularity and many human-oriented algorithms for solving it have been discovered [@ruwix]. These algorithms are simple to memorize and teach humans how to solve the cube in a structured, step-by-step manner.

Human-oriented algorithms to solve the Rubik's Cube, while easy to memorize, find long suboptimal solutions. Since 1981, there has been theoretical work on finding the upper bound for the number of moves necessary to solve any valid cube configuration [@thistlethwaite_1981; @reid_1995; @radu_2007; @kunkle_2007; @rokicki2010twenty]. Finally, in 2014, it was shown that any valid cube can be optimally solved with at most 26 moves in the quarter-turn metric, or 20 moves in the half-turn metric [@rokicki2014diameter; @rokicki_2014]. The quarter-turn metric treats 180 degree rotations as two moves, whereas the half-turn metric treats 180 degree rotations as one move. For the remainder of this paper we will be using the quarter-turn metric. This upper bound on the number of moves required to solve the Rubik's cube is colloquially known as God's Number.

Algorithms used by machines to solve the Rubik's Cube rely on hand-engineered features and group theory to systematically find solutions. One popular solver for the Rubik's Cube is the Kociemba two-stage solver [@kociemba_2018]. This algorithm uses the Rubik's Cube's group properties to first maneuver the cube to a smaller sub-group, after which finding the solution is trivial. Heuristic based search algorithms have also been employed to find optimal solutions. Korf first used a variation of the A* heuristic search, along with a pattern database heuristic, to find the shortest possible solutions [@korf_1997]. More recently, there has been an attempt to train a DNN -- using supervised learning with hand-engineered features -- to act as an alternative heuristic [@Brunetto_2017]. These search algorithms, however, take an extraordinarily long time to run and usually fail to find a solution to randomly scrambled cubes within reasonable time constraints. Besides hand-crafted algorithms, attempts have been made to solve the Rubik's Cube through evolutionary algorithms [@smith2016discovering; @lichodzijewski2011rubik]. However, these learned solvers can only reliably solve cubes that are up to 5 moves away from the solution.

We solve the Rubik's Cube using pure reinforcement learning without human knowledge. In order to solve the Rubik's Cube using reinforcement learning, the algorithm will learn a policy. The policy determines which move to take in any given state. Much work has already been done on finding optimal policy functions in classic reinforcement learning [@sutton_1988; @watkins1992q; @williams1992simple; @moriarty1999evolutionary]. Autodidactic Iteration can be seen as a type of policy iteration algorithm [@sutton1998reinforcement]. Since our implementation of ADI uses a depth-1 greedy policy when training the network, it can also be thought of as value iteration. Policy iteration is an iterative reinforcement learning algorithm which alternates between a policy evaluation step and a policy improvement step. The policy evaluation step estimates the state-value function given the current policy, while the policy improvement step uses the estimated state-value function to improve the policy. Though many policy iteration algorithms require multiple sweeps through the state space for each step, value iteration effectively combines policy iteration and policy improvement into a single sweep. Other works have used neural networks to augment MCTS [@guo_2014; @silver_2016; @anthony_2017]. Our approach is most similar to AlphaGo Zero [@silver_2017] and Expert Iteration [@anthony_2017], which also do not use human knowledge.

The Rubik's Cube

Top view {#fig:solvedTop width="\textwidth"}

Bottom view {#fig:solvedBottom width="\textwidth"}

Top view {#fig:solvedTop_compr width="\textwidth"}

Bottom view {#fig:solvedBottom_compr width="\textwidth"}

The Rubik's Cube consists of 26 smaller cubes called cubelets. These are classified by their sticker count: center, edge, and corner cubelets have 1, 2, and 3 stickers attached respectively. There are 54 stickers in total with each sticker uniquely identifiable based on the type of cubelet the sticker is on and the other sticker(s) on the cubelet. Therefore, we may use a one-hot encoding for each of the 54 stickers to represent their location on the cube.

However, because the position of one sticker on a cubelet determines the position of the remaining stickers on that cubelet, we may actually reduce the dimensionality of our representation by focusing on the position of only one sticker per cubelet. We ignore the redundant center cubelets and only store the 24 possible locations for the edge and corner cubelets. This results in a 20x24 state representation which is demonstrated in Figures 4{reference-type="ref" reference="fig:solvedTop_compr"} and 5{reference-type="ref" reference="fig:solvedBottom_compr"}.

Moves are represented using face notation originally developed by David Singmaster: a move is a letter stating which face to rotate. $\textit{F}$, $\textit{B}$, $\textit{L}$, $\textit{R}$, $\textit{U}$, and $\textit{D}$ correspond to turning the front, back, left, right, up, and down faces, respectively. A clockwise rotation is represented with a single letter, whereas a letter followed by an apostrophe represents a counter-clockwise rotation. For example: $\textit{R}$ and $\textit{R'}$ would mean to rotate the right face $90^{\circ}$ clockwise and counter-clockwise, respectively.

Formally, the Rubik's Cube environment consists of a set of $4.3 \times 10^{19}$ states $\mathcal{S}$ which contains one special state, $s_{solved}$, representing the goal state. At each timestep, $t$, the agent observes a state $s_t \in \mathcal{S}$ and takes an action $a_t \in \mathcal{A}$ with $\mathcal{A} := \left\lbrace \textit{F, F', \ldots , D, D'} \right\rbrace$. After selecting an action, the agent observes a new state $s_{t+1} = A(s_t, a_t)$ and receives a scalar reward, $R(s_{t+1})$, which is 1 if $s_{t+1}$ is the goal state and $-1$ otherwise.

Methods

Visualization of training set generation in ADI. (a) Generate a sequence of training inputs starting from the solved state. (b) For each training input, generate its children and evaluate the value network on all of them. (c) Set the value and policy targets based on the maximal estimated value of the children. {#fig:training width="99%"}

We develop a novel algorithm called Autodidactic Iteration which is used to train a joint value and policy network. Once the network is trained, it is combined with MCTS to solve randomly scrambled cubes. The resulting solver is called DeepCube.

Autodidactic Iteration

ADI is an iterative supervised learning procedure which trains a deep neural network $f_\theta(s)$ with parameters $\theta$ which takes an input state $s$ and outputs a value and policy pair $(v, \boldsymbol{p})$. The policy output $\boldsymbol{p}$ is a vector containing the move probabilities for each of the $12$ possible moves from that state. Once the network is trained, the policy is used to reduce breadth and the value is used to reduce depth in the MCTS. In each iteration of ADI, training samples for $f_\theta$ are generated by starting from the solved cube. This ensures that some training inputs will be close enough to have a positive reward when performing a shallow search. Targets are then created by performing a depth-1 breadth-first search (BFS) from each training sample. The current value network is used to estimate each child's value. The value target for each sample is the maximum value and reward of each of its children, and the policy target is the action which led to this maximal value. Figure 6{reference-type="ref" reference="fig:training"} displays a visual overview of ADI.

Formally, we generate training samples by starting with $s_{solved}$ and scrambling the cube $k$ times to generate a sequence of $k$ cubes. We do this $l$ times to generate $N = k*l$ training samples $X = [x_{i}]{i=1}^N$. Each sample in the series has the number of scrambles it took to generate it, $D(x_i)$, associated with it. Then, for each sample $x_i \in X$, we generate a training target $Y_i = (y{v_i}, \boldsymbol{y}{p_i})$. To do this, we perform a depth-1 BFS to get the set of all children states of $x_i$. We then evaluate the current neural network at each of the children states to receive estimates of each child's optimal value and policy: $\forall a \in \mathcal{A}, \ (v{x_i}(a), \boldsymbol{p}{x_i}(a)) = f\theta(A(x_i, a))$. We set the value target to be the maximal value from each of the children $y_{v_i} \gets \max_a(R(A(x_i, a)) + v_{x_i}(a))$, and we set the policy target to be the move that results in the maximal estimated value $\boldsymbol{y}{p_i} \gets \operatorname{argmax}a (R(A(x_i, a)) + v{x_i}(a))$. We then train $f\theta$ on these training samples and targets $[x_i, y_{v_i}]{i=1}^N$ and $[x_i, \boldsymbol{y}{p_i}]_{i=1}^N$ to receive new neural network parameters $\theta'$. For training, we used the RMSProp optimizer [@hinton_2014] with a mean squared error loss for the value and softmax cross entropy loss for the policy. Although we use a depth-1 BFS for training, this process may be trivially generalized to perform deeper searches at each $x_i$.

Initialization: $\theta$ initialized using Glorot initialization\

Weighted samples {#weighted-samples .unnumbered}

During testing, we found that the learning algorithm sometimes either converged to a degenerate solution or diverged completely. To counteract this, we assigned a higher training weight to samples that are closer to the solved cube compared to samples that are further away from solution. We assign a loss weight to each sample $x_i$, $W(x_i) = \frac{1}{D(x_i)}$. We didn't see divergent behavior after this addition.

Solver

We employ an asynchronous Monte Carlo Tree Search augmented with our trained neural network $f_\theta$ to solve the cube from a given starting state $s_0$. We build a search tree iteratively by beginning with a tree consisting only of our starting state, $T = \left\lbrace s_0 \right\rbrace$. We then perform simulated traversals until reaching a leaf node of $T$. Each state, $s \in T$, has a memory attached to it storing: $N_s(a)$, the number of times an action $a$ has been taken from state $s$, $W_s(a)$, the maximal value of action $a$ from state $s$, $L_s(a)$, the current virtual loss for action $a$ from state $s$, and $P_s(a)$, the prior probability of action $a$ from state $s$.

Every simulation starts from the root node and iteratively selects actions by following a tree policy until an unexpanded leaf node, $s_\tau$, is reached. The tree policy proceeds as follows: for each timestep $t$, an action is selected by choosing, $A_t = \mathop{\mathrm{argmax}}{a} U{s_t}(a) + Q_{s_t}(a)$ where $U_{s_t}(a) = c P_{s_t}(a) {\sqrt{\sum_{a'}^{} N_{s_t}(a')}} / (1 + N_{s_t}(a))$, and $Q_{s_t}(a) = {W_{s_t}(a)} - L_{s_t}(a)$ with an exploration hyperparameter $c$. The virtual loss is also updated $L_{s_t}(A_t) \gets L_{s_t}(A_t) + \nu$ using a virtual loss hyperparameter $\nu$. The virtual loss prevents the tree search from visiting the same state more than once and discourages the asynchronous workers from all following the same path [@segal_2011].

Once a leaf node, $s_\tau$, is reached, the state is expanded by adding the children of $s_\tau$, ${ A(s_\tau, a), \forall a \in \mathcal{A}}$, to the tree $T$. Then, for each child $s'$, the memories are initialized: $W_{s'}(\cdot) = 0$, $N_{s'}(\cdot) = 0$, $P_{s'}(\cdot) = \boldsymbol{p}{s'}$, and $L{s'}(\cdot) = 0$, where $\boldsymbol{p}{s'}$ is the policy output of the network $f\theta(s')$. Next, the value and policy are computed: $(v_{s_\tau}, \boldsymbol{p}{s\tau}) = f_\theta(s_\tau)$ and the value is backed up on all visited states in the simulated path. For $0 \leq t \leq \tau$, the memories are updated: $W_{s_t}(A_t) \gets \max{} (W_{s_t}(A_t), v_{s_\tau}), N_{s_t}(A_t) \gets N_{s_t}(A_t) + 1, L_{s_t}(A_t) \gets L_{s_t}(A_t) - \nu$. Note that, unlike other implementations of MCTS, only the maximal value encountered along the tree is stored, and not the total value. This is because the Rubik's Cube is deterministic and not adversarial, so we do not need to average our reward when deciding a move.

The simulation is performed until either $s_\tau$ is the solved state or the simulation exceeds a fixed maximum computation time. If $s_\tau$ is the solved state, then the tree $T$ of the simulation is extracted and converted into an undirected graph with unit weights. A full breath-first search is then applied on $T$ to find the shortest predicted path from the starting state to solution. Alternatively, the last sequence $Path = \left\lbrace A_t | 0 \leq t \leq \tau \right\rbrace$ may be used, but this produces longer solutions.

Results

R0.3 {width="29%"}

We used a feed forward network as the architecture for $f_\theta$ as shown in Figure [fig:network]{reference-type="ref" reference="fig:network"}. The outputs of the network are a 1 dimensional scalar $v$, representing the value, and a 12 dimensional vector $\boldsymbol{p}$, representing the probability of selecting each of the possible moves. The network was then trained using ADI for 2,000,000 iterations. The network witnessed approximately 8 billion cubes, including repeats, and it trained for a period of 44 hours. Our training machine was a 32-core Intel Xeon E5-2620 server with three NVIDIA Titan XP GPUs.

Comparison of solve rates among different solvers. Each solver was given 30 minutes to compute a solution for 50 cubes. {#fig:perf_time width="\textwidth"}

DeepCube's Distribution of solve times for the 640 fully scrambled cubes. All cubes were solved within the 60 minute limit. {#fig:solve_time width="\textwidth"}

As a baseline, we compare DeepCube against two other solvers. The first baseline is the Kociemba two-stage solver [@kociemba_2018; @tsoy_2018]. This algorithm relies on human domain knowledge of the group theory of the Rubik's Cube. Kociemba will always solve any cube given to it, and it runs very quickly. However, because of its general-purpose nature, it often finds a longer solution compared to the other solvers. The other baseline is the Korf Iterative Deepening A* (IDA*) with a pattern database heuristic [@korf_1997; @brown_2017]. Korf's algorithm will always find the optimal solution from any given starting state; but, since it is a heuristic tree search, it will often have to explore many different states and it will take a long time to compute a solution. We also compare the full DeepCube solver against two variants of itself. First, we do not calculate the shortest path of our search tree and instead extract the initial path from the MCTS: this will be named Naive DeepCube. We also use our trained value network as a heuristic in a greedy best-first search for a simple evaluation of the value network: this will be named Greedy. An overview of each of the solver's capacity to solve cubes is presented in Figure 7{reference-type="ref" reference="fig:perf_time"}.

We compare our results to Kociemba using $640$ randomly scrambled cubes. Starting from the solved cube, each cube was randomly scrambled $1000$ times. Both DeepCube and Kociemba solved all $640$ cubes within one hour. Kociemba solved each cube in under a second, while DeepCube had a median solve time of $10$ minutes. The systematic approach of Kociemba explains its low spread of solution lengths with an interquartile range of only $3$. Although DeepCube has a much higher variance in solution length, it was able to match or beat Kociemba in $55%$ of cases. We also compare DeepCube against Naive DeepCube to determine the benefit of performing the BFS on our MCTS tree. We find that the BFS has a slight, but consistent, performance gain over the MCTS path ($-3$ median solution length). This is primarily because the BFS is able to remove all cycles from the solution path. A comparison of solution length distributions for these three solvers is presented in the left graph of Figure 9{reference-type="ref" reference="fig:benchmark"}.

We could not include Korf in the previous comparison because its runtime is prohibitively slow: solving just one of the $640$ cubes took over $6$ days. We instead evaluate the optimality of solutions found by DeepCube by comparing it to Korf on cubes closer to solution. We generated a new set of $100$ cubes that were only scrambled $15$ times. At this distance, all solvers could reliably solve all $100$ cubes within an hour. We compare the length of the solutions found by the different solvers in the right graph of Figure 9{reference-type="ref" reference="fig:benchmark"}. Noticeably, DeepCube performs much more consistently for close cubes compared to Kociemba, and it almost matches the performance of Korf. The median solve length for both DeepCube and Korf is $13$ moves, and DeepCube matches the optimal path found by Korf in $74%$ of cases. However, DeepCube seems to have trouble with a small selection of the cubes that results in several solutions being longer than 15 moves. Note that Korf has one outlier that is above 15 moves. This is because Korf is based on the half-turn metric while we are using the quarter-turn metric.

Furthermore, our network also explores far fewer tree nodes when compared to heuristic-based searches. The Korf optimal solver requires an average expansion of 122 billion different nodes for fully scrambled cubes before finding a solution [@korf_1997]. Our MCTS algorithm expands an average of only 1,136 nodes with a maximum of 3,425 expanded nodes on the longest running sample. This is why DeepCube is able to solve fully scrambled cubes much faster than Korf.

Distribution of solution lengths for DeepCube, Kociemba, and Korf. The left graph features naive DeepCube to evaluate the effect of our shortest path search. The right graph features the Korf optimal solver to evaluate how well DeepCube can find short solutions. The red lines represent the 26 and 15 move upper bound on the left and right respectively. {#fig:benchmark width="99%"}

Knowledge Learned

r0.3 {width="28%"}

DeepCube discovered a notable amount of Rubik's Cube knowledge during its training process, including the knowledge of how to use complex permutation groups and strategies similar to the best human "speed-cubers". For example, DeepCube heavily uses one particular pattern that commonly appears when examining normal subgroups of the cube: $aba^{-1}$. That is, the sequences of moves that perform some action $a$, performs a different action $b$, and then reverses the first action with $a^{-1}$. An intelligent agent should use these conjugations often because it is necessary for manipulating specific cubelets while not affecting the position of other cubelets.

We examine all of the solutions paths that DeepCube generated for the 640 fully scrambled cubes by moving a sliding window across the solutions strings to gather all triplets. We then compute the frequency of each triplet and separate them into two categories: matching the conjugation pattern $aba^{-1}$ and not matching it. We find that the top 14 most used triplets were, in fact, the $aba^{-1}$ conjugation. We also compare the distribution of frequencies for the two types of triplets. In Figure 10{reference-type="ref" reference="fig:distribution"}, we plot the distribution of frequencies for each of the categories. We notice that conjugations appear consistently more often than the other types of triplets.

We also examine the strategies that DeepCube learned. Often, the solver first prioritizes completing a 2x2x2 corner of the cube. This will occur approximately at the half way point in the solution. Then, it uses these conjugations to match adjacent edge and corner cubelets in the correct orientation, and it returns to either the same 2x2x2 corner or to an adjacent one. Once each pair of corner-edge pieces is complete, the solver then places them into their final positions and completes the cube. An example of this strategy is presented in Figure [fig:strategy]{reference-type="ref" reference="fig:strategy"}. This mirrors a strategy that advanced human "speed-cubers" employ when solving the cube, where they prioritize matching together corner and edge cubelets before placing them in their correct locations.

$Comparison of the distribution of frequency of the two types of triplets. We split the triplets into conjugations ($aba^{-1}$) and non-conjugations. We then calculate the frequency of each triplet and plot the two distributions. The two vertical lines are the means of their respective distributions.$ {#fig:distribution width="99%"}

Discussion

We are currently further improving DeepCube by extending it to harder cubes. Autodidactic Iteration can be used to train a network to solve a 4x4x4 cube and other puzzles such as n-dimensional sequential move puzzles and combination puzzles involving other polyhedra.

Besides further work with the Rubik's Cube, we are working on extending this method to find approximate solutions to other combinatorial optimization problems such as prediction of protein tertiary structure. Many combinatorial optimization problems can be thought of as sequential decision making problems, in which case we can use reinforcement learning. Bello et. al. train an RNN through policy gradients to solve simple traveling salesman and knapsack problems [@bello]. We believe that harnessing search will lead to better reinforcement learning approaches for combinatorial optimization. For example, in protein folding, we can think of sequentially placing each amino acid in a 3D lattice at each timestep. If we have a model of the environment, ADI can be used to train a value function which looks at a partially completed state and predicts the future reward when finished. This value function can then be combined with MCTS to find approximately optimal conformations.

Léon Bottou defines reasoning as "algebraically manipulating previously acquired knowledge in order to answer a new question"[@bottou_2011]. Many machine learning algorithms do not reason about problems but instead use pattern recognition to perform tasks that are intuitive to humans, such as object recognition. By combining neural networks with symbolic AI, we are able to create algorithms which are able to distill complex environments into knowledge and then reason about that knowledge to solve a problem. DeepCube is able to teach itself how to reason in order to solve a complex environment with only one reward state using pure reinforcement learning.

In summary, combining MCTS with neural networks is a powerful technique that helps to bridge the gap between symbolic AI and connectionism. Furthermore, it has the potential to provide approximate solutions to a broad class of combinatorial optimization problems.

Acknowledgments {#acknowledgments .unnumbered}

We thank Harm van Seijen and Yutian Chen for helpful discussions.

[^1]: Equal contribution