Solving Long Term Planning Problems with Direct Future Prediction

(1)

Solving Long Term Planning Problems with Direct Future Prediction

Sindre Thoresen Kai Olav Ellefsen

Jim Torresen

Thesis submitted for the degree of

Master in Informatikk: Robotikk og intelligente systemer 60 credits

Institutt for informatikk

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)

(3)

Solving Long Term Planning Problems with Direct Future

Prediction

Sindre Thoresen Kai Olav Ellefsen

Jim Torresen

(4)

Solving Long Term Planning Problems with Direct Future Prediction http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

Reinforcement learning is a rapidly growing field in machine learning that has the potential to foster algorithms that can solve a wide variety of problems. Two major classes of reinforcement learning are model-free and model-based methods. Model-based methods try to learn predictive models of the environment to achieve better efficiency and relia- bility. However, learning good models is difficult, as minor inaccuracies compound over time and make long-term planning difficult. Because of this and other practical problems, model-based methods have not seen the same success as model-free methods.

Direct Future Prediction (DFP) is a reinforcement learning algorithm that won a Viz- Doom AI competition in 2016 and received significant attention in the reinforcement learning community. Interestingly, it falls somewhere in between model-based and model- free methods. This is because DFP does not try to learn detailed models that simulate the environment. Instead, it only tries to predict the agent’s effect on a few high-level features. The idea is that this can avoid the problem of compounding errors by focusing on predicting just what is important.

DFP has so far only been tested on a first-person shooter environment that is reactive and offers few long-term planning challenges. This thesis focuses on testing DFP in other environments that explicitly test long term-planning and exploration. These are challenges found in many reinforcement learning problems, and documenting DFP’s ability to solve them could give insight into the role of predictive modeling in reinforcement learning. The thesis tests DFP in two new environments and examines its performance.

Results show that DFP can fail to solve certain long-term planning problems, but also that it can be surprisingly effective with the right configuration.

The thesis also proposes a new method that combines DFP with a recent technique in intrinsic motivation, a field in reinforcement learning that focuses on effective exploration.

The new method uses the variance over an ensemble of DFP models as an uncertainty signal to guide exploration. The method is tested on hard exploration and multi-task problems. Results are promising on some problems, where the new method was able to find several solutions, while DFP found none. However, problems with the ensemble

(6)

approach are also discussed, which may hinder scaling to more complex environments.

(7)

Acknowledgements

I would like to thank my supervisor Kai Olav Ellefsen, and my peers Anja and Vegard, who formed our supervisory group. In a year characterized by COVID-19, our weekly discussions, guidance, and feedback motivated me to finish the thesis when the campus was closed. I also want to thank Jim Torresen for his supporting words at the start of the term.

I would also like to thank my co-students and friends that I have met during my years at the Department of Informatics. The collaboration, friendship, and fun we have had along the way gave me the support I needed throughout my studies and made the journey worth it.

Finally, I would like to thank my family for supporting me throughout my studies. You made it all possible.

(8)

Introduction

1.1 Motivation

Reinforcement learning is a branch of Machine Learning focused on problems where the algorithm gets a reward signal according to how well it is doing but no instruction on how to improve. This makes reinforcement learning a general framework that can be applied to any step-wise decision-making process with a reward signal. RL can indeed be applied to interesting problems such as robotic control (Kober et al., 2013) and optimizing warehouse cooling (Li et al., 2020).

Broadly, reinforcement learning methods can be classified as model-free or model-based methods. Model-free methods attempt to solve problems by learning a policy directly.

Given a challenge, the method will suggest a solution directly based on what it has learned from previous experience. Model-based methods, on the other hand, attempt to learn the challenge as well. They can attempt to simulate a challenge, trying different strategies and measuring the effectiveness of each one before suggesting a solution. Given this high- level description of the difference between model-free and model-based methods, model- based methods may seem superior, but these methods are hard to implement effectively in practice. It can be hard to learn a problem well enough to simulate it, and even a small error may throw the method of. Model-free methods are simpler, and advances in deep learning have shown that they can solve hard problems (Kapturowski et al., 2019), (Hessel et al., 2017).

Direct Future Prediction (Dosovitskiy and Koltun, 2016) is an interesting method that falls somewhere in between the classes of model-based and model-free methods. It showed merit when it won a track of the Visual Doom AI competition in 2016. DFP does not try to learn challenges in such detail that it can simulate them or even play them out recursively at all. Instead, it only tries to predict a few high-level features of the future

(11)

of the problem, called measurements. The idea is that this can avoid the problem of compounding errors by focusing just on what is important.

Direct Future Prediction and the underlying strategy of using a set of key measurements to guide prediction efficiently can be applied in a wide range of problems, like in robotic control or self-driving cars, by using measurements such as distance to objects, velocity, and acceleration. However, in the Learning To Act By Predicting The Future paper (Dosovitskiy and Koltun, 2016), DFP is only tested in the VizDoom first-person shooter environment. The VizDoom environment is reactive, meaning that it can be solved by quickly reacting to visual objects and monsters currently in view. Therefore, it is of interest to test DFP in other environments that impose challenges in exploration and long-term planning to study the potential for prediction-based methods in reinforcement learning.

The thesis also presents and investigates a novel augmentation to Direct Future Prediction that incorporates intrinsic motivation in DFP. The approach is a form of active learning, and the hope is that the model will be more sample-efficient and explore better in multi- task learning.

1.2 Research questions

The thesis aims to study how well DFP can solve general reinforcement learning problems, with challenges such as sparse rewards, multiple objectives, and challenges where a long planning horizon is required. The efforts in the thesis can be distilled into two research questions:

How does Direct Future Prediction perform in other environments? To my knowledge, DFP has only ever been tested on the Vizdoom environment. DFP performed particularly well in the environment, winning a Visual Doom AI competition track in 2016. VizDoom is a first-person shooter environment that can be solved reactively, just responding to the objects and enemies currently in view. This thesis studies DFP’s performance in other environments that test other abilities such as exploration and long- term planning. By testing DFP on other environments and in different settings, insights can be gained that not only uncovers the strengths and limitations of DFP but also of prediction-based reinforcement learning in general.

Can intrinsic motivation be incorporated into Direct Future Prediction to improve exploration and multi-task learning with an active learning approach?

An advantage of DFP is that the goal (objective function) can be changed dynamically online without requiring retraining. This ability inspired a novel augmentation of DFP, which was developed as part of this thesis. The new method, Ensemble-DFP, combines

(12)

DFP with a recent method from intrinsic motivation that uses the variation of ensemble predictions as a learning signal. The thesis tests Ensemble-DFPs performance on hard- exploration and multi-task problems and discusses DFP’s ability to use active learning in general.

1.3 Thesis structure

The thesis is structured into 6 chapters. Chapter 2 covers the background that is needed to understand the thesis. It introduces the reinforcement learning framework, presents some recent model-free and model-based methods, covers Direct Future Prediction (Dosovitskiy and Koltun, 2016), and presents the intrinsic motivation method that is used in Ensemble- DFP.

Chapter 3 introduces the environments used to test DFP in the thesis, covering why they are chosen and how they are made compatible with DFP. Policy visualization specific to each environment is also described. The chapter also introduces Ensemble-DFP, the method developed in this thesis.

Chapter 4 covers implementation details, and chapter 5 presents the experiments and the results.

Chapter 6 discusses the results in relation to the research questions, suggests future work and concludes the thesis.

(13)

Chapter 2

Background

This chapter provides the necessary fundamentals to understand reinforcement learning, and other subjects relevant to the thesis. It first describes reinforcement learning in the context of machine learning. Then, it describes the Markov Decision Process, a useful mathematical framework to understand RL environments. It also covers policy optimization frameworks, on/off-policy RL and the differences in model-based/free RL.

Deep RL and some recent model-based deep RL methods are presented, including Direct Future Prediction. Finally, intrinsic motivation is introduced.

2.1 Reinforcement learning

Reinforcement learning (RL) is often recognized as one of the tree main branches of Machine Learning, the others being Supervised learning and Unsupervised learning. Su- pervised learning is the situation where a program have access to labels: right answers.

Perhaps the most notable example of supervised learning is image classification, where the goal is to make a program classify images into different categories: for example, dog- and cat images. Labels help guide the learning process by telling where the program is wrong, information that can be used to learn how to optimize the neural network parameters using back propagation.

In unsupervised learning the program does not have access to labels. Instead, the program must learn solely by analyzing the data (Sutton, 2018, Chapter 1.1). A natural use case for unsupervised learning is cluster analysis, using similarity measures such as Euclidean distance to partition the data. Some popular algorithms are k-Means clustering and Self-organizing maps.

In reinforcement learning, the program also lacks access to labels. However, it can get feedback in the form of a reward signal. A textbook RL example is learning the game

(14)

of chess. The program does not have access to the most optimal move to play as a label retrospectively, as the programmer does not have access to this information. However, the program can leverage a reward signal of the end of the game (win/loss), which weakly correlates to the quality of play. The challenge in this context is to uncover which moves were good, and which were bad. This is known as the credit assignment problem, and is a key challenge in RL.

2.1.1 Reinforcement learning environments

In reinforcement learning, the environment is the scene in which the decision-maker (agent) operates. The environment of the well-known program AlphaGo (Silver et al., 2017) is the board game Go. For a robotic control agent the environment is the scene were the robot operates, simulated or in the real world. The environment can be any setting where the agent is able to interact, and typically some type of reward signal is present.

For the pragmatic, efficient development of RL algorithms, some popular virtual environments have been established as good benchmarks that test the agent in important, challenging tasks. If an agent scores well in many tasks within a benchmark, the thought is that the agent has developed competence that can be applied to other tasks, or even used in the real world. Such a benchmark is ALE (Atari Learning environment) (Belle- mare et al., 2013). It is a challenging environment with diverse tasks represented by Atari 2600 games.

Figure 2.1: Examples of games in the Atari Learning Environment. From left to right: Breakout, KungFuMaster, MsPacMan and Pong.

2.1.2 Markov Decision Processes

Reinforcement learning is often formalized using the Markov Decision Process (MDP). It provides a mathematical framework for modeling decision making in discrete time. An MDP can be described by a tuple (S, A, T, R), where

(15)

• S is the set of states; the state space

• A is the set of actions; the action space

• T(s⁰|s, a) is the transition probability that taking action a in state s (time t), will lead to state s’ (time t+1).

• R(s⁰, s) is the immediate reward received by transitioning from state s to s’.

At each time step the agent receives anobservationandrewardfrom the environment, and picks an action based on its policy.

Theobservationcan either describe the state completely or be any arbitrary clue about the state. In the first case, the MDP is fully observable, and T(s⁰|s, a), R(s⁰, s) can be calculated if T and R are known. In the latter case, the MDP is only partially observable, and T(s⁰|s, a), R(s⁰, s) can not be solved directly as the state is not known. An example of a fully observable MDP is chess, assuming the observation is the layout of the board and position of pieces. On the other hand, Atari Games with visual (pixel) input is only partially observable as the agent does not have access to the underlying RAM, which is the state.

A fundamental assumption for MDPs is that the state transitions is conditioned on just the state and action; the transition must be independent of the history of previous states and actions. This assumption is called the Markov property.

The policyπ(a|s, θ) is the agent strategy. Given a state s and the current parametersθ, the policy will pick an action. In value-based methods, a function mapping state-action pairs to the expected cumulative reward Q(s, a) is learned, and the policy can simply query the Q-function for the optimal move.

The goal of the agent is to maximize the cumulative reward PT

t=0rt, where T is the length of the episode. A popular extension to this formula is the notion of discounted rewards by the discount rate γ: PT

t=0γ^tr_t, where 0< γ ≤1. Ifγ <1, future rewards are discounted, giving incentive for the agent to prioritize rewards near in time and finishing the episode fast, which can be advantageous in stochastic environments or when π, T or R are estimates.

2.1.3 Policy optimization frameworks

Section 2.1.2 described MDPs, and that a policy is a strategy of what actions to pick in the environment. It is the agents goal is to maximize the reward PT

t=0r_t, and since the goal of reinforcement learning is to generate the best agents, it is naturally of interest to explore how to formulate and optimize policies. There are three distinct frameworks for policy optimization.

(16)

Value-based methods attempt to learn the optimal or approximate state-action pair func- tionQ(S, A). For every state S and action A,Q(S, A) estimates the expected cumulative reward from that state and action, following the optimal policy π∗. Note that the Q- function can also be implemented as Q(S), where similarly, every state is valued, but independent of the next action. To balance exploration and exploitation, the policy might not want to pick the action with the highest Q-value every time. A popular policy is −greedy, which has a chance of choosing a random action to explore, and a (1-) chance of choosing the best-valued action (exploitation). For traditional RL problems with finite, limited state and action spaces, the Q-function can be a table with an entry for every state-action pair. It can also be an arbitrary parameterized function such as an artificial neural network (ANN).

Policy gradient methods does not attempt to learn the state-action pair functionQ(S, A).

Instead, policy gradient methods search directly in policy space. The goal is to directly optimize a parameterized policy. Such a policy can be written asπ(a|s, θ), whereθ is the parameter vector. In deep reinforcement learning,θ is implemented as an ANN with the observation as input, and the actions as output. Policy gradient methods optimizes the policy θ by gradient descent, which can be written as θ_k+1 = θ_k +α∇_θJ(π_θ)|_θ_k, where J(θ) is the expected cumulative reward of an episode using θ, and α is the learning rate parameter. ∇_θJ(π_θ) is called the policy gradient and can be thought of as the gradient of policy performance (Achiam, 2018). An example of a policy gradient algorithm is TRPO (Schulman et al., 2015).

Actor-criticmethods are a combination of value-based methods and policy gradient methods. The architecture consists of both a state-value (Q) function and a parameterized policy. The parameterized policy is the actor part of the architecture, while the state- value function is the critic, helping to guide the policy in the right direction. Mathemati- cally, the Q-function can be implemented in the J(θ) term (expected cumulative reward) from policy gradients: J(θ) = E[PT

t=0π(a_t|s_t, θ)Q(s_t, a_t)]. In words, the Q-function is used to value the policy. Note that this is just one example of an actor-critic method.

Some examples of good actor-critic algorithms are DDPG (Lillicrap et al., 2016) and SAC Haarnoja et al. (2018).

2.1.4 Model-based and model-free RL

Section 2.1.2 described an MDP as a 4-tuple, one part being the state transition function T(s⁰|s, a). The strategy concerning T is the differentiator between model-based and model-free RL. Model-based methods attempt to learn this function, while model-free methods do not.

(17)

Model-based methods

The model in model-based is the approximation of T(s⁰|s, a). The model can be used to simulate the environment by inputting state-action pairs, for which it will output predicted states; simulated experience. Simulated experience can be used to improve the policy in the same way as real experience (state transitions from the environment).

Leveraging simulated experience can be thought of planning, and (Sutton, 2018, Chap- ter 8.1) defines the term as: ”Any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment”.

If the model is perfect, the model can be used to simulate entire episodes without interacting with the environment at all, and planning can find the optimal policy given enough computational resources. However, in complex environments such as Atari it can not be assumed that the model is perfect, much less that a model is available ahead of time. Typically, interaction with the environment is used to improve the model, as new observations are new information. Dyna-Q (Sutton, 2018, Chapter 8.2) is an abstract architecture of a basic model-based RL algorithm, see figure 2.2.

Figure 2.2: The relationship of the policy, experience and model in the Dyna-Q framework. Figure taken from (Sutton, 2018, Chapter 8.2)

The Dyna-Q figure highlights that real experience from the environment can be used both to improve the model, and to improve the policy as in model-free RL. The model can be leveraged to generate simulated experience, which through planning can improve the policy in parallel.

Dyna-Q presents a method to leverage the model to produce simulated experience in parallel, passively, to improve the policy. However, it is also possible to plan online, using the model to play out future scenarios to chose the next action. This decision- time look-ahead planning is known as heuristic search (Sutton, 2018, Chapter 8.9). A well known algorithm using heuristic search is TD-Gammon (Tesau, 1995), an algorithm

(18)

learning to play backgammon at grandmaster-level, using heuristic search to simulate possible trajectories in a tree-structure to pick the action leading to the most promising states.

AlphaGo (Silver et al., 2017) is a recent well-known algorithm that beat the world cham- pion in the board game Go. AlphaGo builds on the idea of TD-Gammon, and performs heuristic search using Monte Carlo Tree Search (MCTS), combined with a deep ANN that learns action values.

Model-free methods

As stated model-free methods does not learn the transitions function T(s⁰|s, a) as the model does in model-based RL. As a result, they do not improve the policy by planning as they can not look ahead, instead model-free methods attempt to learn state-values directly (value-based methods), or optimize the policy directly (policy gradient methods).

2.1.5 On-policy and off-policy methods

Two distinct paths of RL are on-policy or off-policy methods. In on-policy methods, the current policy is estimated in the update formula 2.1 every time step, to update the Q-function. In off-policy methods however, updates to the Q-function is done without assuming the policy is followed, instead following a greedy policy maxQ(s,a) as shown in equation 2.2.

SARSA is the textbook example of on-policy algorithm, with the following update- formula for state-action pairs:

Q(S_t, A_t)←Q(S_t, A_t) +α[R_t+1+γQ(S_t+1, A_t+1)−Q(S_t, A_t)] (2.1) The update is done after every state transition St → St+1. The state-action pair is estimated by the sum R_t+1+γQ(S_t+1, A_t+1); the immediate reward and the next state- action value. The key element to note is thatQ(S_t, A_t) is updated dependant on anA_t+1 chosen by the policy. In other words, the Q-function evaluates itself.

On the other hand, Q-learning is the text-book example of an off-policy algorithm, with the update-formula for state-action pairs:

Q(St, At)←Q(St, At) +α[Rt+1+γmaxaQ(St+1, a)−Q(St, At)] (2.2) The formula is similar to 2.1, with the only difference being the max_aQ(S_t+1, a) term.

The effect is that the Q(S_t, A_t) is only dependant on on the best (greedy) action A_t+1,

(19)

and not on a potentially random action as in SARSA, if the policy is −greedy. In practice, this difference causes SARSA to be more cautious in stochastic environments, while Q-Learning will prefer finding more optimal, risky solutions.

2.1.6 Deep reinforcement learning

DQN, Deep Q-Networks, (Mnih et al., 2013) combined Q-Learning from traditional RL with deep learning, and the approach outperformed all previous approaches on six of the seven games in the ALE benchmark. The paper can be considered to have launched the field of deep reinforcement learning.

Applying ideas from traditional reinforcement learning to ALE is hard. For one, the state space is too vast to store in a table like in tabular RL. DQN represented the state-action pair value function Q(S, A) with an ANN. Specifically, the state is the input to the ANN and the output has the dimensionality of the action space, each output representing the estimated Q-value of that action. This is illustrated in figure 2.3.

ANN

S

Q(S,A1) Q(S,A2)

Q(S,An) ...

Figure 2.3: Deep Q-learning uses an artificial neural network to represent the Q-function.

During training of this Q-network, the loss function is inspired by the iterative Bellman formula 2.2, as used in Q-learning. Simplified, the loss function is essentially the square error of the Q-estimate before and after the time step:

L(θ_i) = (r+γmax_a⁰Q(s⁰, a⁰, θ_i−1)−Q(s, a, θ_i))² Where θ is the Q-network parameters.

DQN also implements a series of tricks such as storing transitions in an experience replay buffer, to randomly sample from instead of using consecutive samples which are highly correlated. Not all deep RL is based on Q-learning and Q-functions, policy gradient methods are a popular approach that search directly in policy state, as mentioned in section 2.1.3.

(20)

2.2 Recent model based reinforcement learning methods

2.2.1 SimPLe

SimPLe (Kaiser et al., 2019) is a model-based RL method that achieved state-of-the art sample efficiency on the Atari benchmark. It was the first model-based RL method that was able to compete with model-free methods on the benchmark.

The method learns a model designed to replicate the environment throughout the training process that is leveraged for simulated experience. The method is quite similar to Dyna- Q (section 2.1.4), with the difference being that experience does not affect the policy through ”direct RL”. The policy (initialized to random) generates observation from interacting with the environment, the observations train the world model, and the world model generates trajectories to improve the policy. This loop is illustrated in figure 2.4.

Figure 2.4: The simPLe loop showing the relationships of the policy, observations and world model.

Figure taken from Kaiser et al. (2019)

.

The policy is updated with simulated trajectories similar to how DQN (Mnih et al., 2013) uses experience replay. Stored transition data from the environment is sampled, however alternative trajectories are generated from the real trajectories by the model. This allows the policy to explore different decision and see the outcomes predicted by the model; in theory increasing sample efficiency if the model is accurate. SimPLe uses proximal policy optimization (PPO) (Schulman et al., 2017) as both the roll-out policy (the policy used to explore the model), and as the main policy interacting with the environment.

The world model is trained to simulate the real environment, with the same action space, rewards and observations. The architecture used to learn the model is a novel stochastic video prediction model, however it is somewhat similar to a variational autoencoder

(21)

(Kingma and Welling, 2014). The input is the four previous frames and an action, and the output is the predicted next frame.

The authors claimed state-of-the art sample efficiency in the preprint, but later research (Van Hasselt et al., 2019) showed that Rainbow (Hessel et al., 2017) could be tuned to be equally sample-efficient in low data Atari regimes. Still, it remains the first successful model-based approach to be competitive on the benchmark. In terms of computational efficiency however, SimPLe is not as impressive. The paper presented results after 100k time steps (interactions with the environment). During these 100k time steps, SimPLe interacted 15.2M times with the world model, and considering that the world model is trained to mimic the environment fully (not a latent model), the computational time is considerable compared to model-free methods.

Although SimPLe remains as a proof-of-concept for model-based methods on Atari, Dreamer (section 2.2.2) uses a latent world model which is considerably more computationally efficient, and MuZero (Schrittwieser et al., 2019) combines learned models with tree search.

2.2.2 Dreamer

Dreamer (Hafner et al., 2019) is a recent model-based deep reinforcement learning algorithm. The authors claim it beats previous approaches in visual control in terms of sample-efficiency, computation time and final performance. Especially the computational efficiency is impressive, as planning with model-based methods often is expensive.

Model-based methods predict rewards in a finite time horizon (roll-out length/depth), and uses this reward prediction to guide the policy. A weakness of this approach is that it may lead to shortsighted behaviour, as rewards beyond the imagination horizon are not considered. Mitigating this problem is arguably the focus of Dreamer. They use a novel actor-critic algorithm to predict both action and state values, meant to give insight beyond the imagination horizon.

Throughout the agents lifetime, it performs three operations interleaved or in parallel.

These are illustrated in figure 2.5.

• Learning compact latent state representation of observations and actions, and predict rewards.

• Predict state values and actions by Bellman consistency and analytical gradient.

• Compute model state and act in the environment to collect new experiences

(22)

Figure 2.5: a) Learn the latent dynamics b) Learn behaviour latently c) Act in the environment. Figure taken from Hafner et al. (2019)

The latent dynamics model consists of three components. A representational model L(s_t|st−1, at−1, o_t) that encodes observations into states, a reward model that predicts the states rewardsR(r_t|s_t), and a transition modelT(s_t|st−1, at−1) that predicts the next state without seeing an observation. L is implemented as a recurrent state space model (RSSM) (Hafner et al., 2018), and R as a dense neural network.

The representational modelLwas implemented in several ways. The authors got best results with a pixel reconstruction technique based on a variational autoencoder. However, this approach has some issues. Okada and Taniguchi (2020) points out that Dreamer can suffer from object vanishing, where important objects are not represented in the latent space of the autoencoder. The authors of Dreamer themselves highlight representation learning as a future research direction, especially in visually complex environments such as Atari.

The experiments were conducted on multiple tasks in two environments: visual control tasks in DeepMind Control Suite Tassa et al. (2018), and on Atari. On the control tasks, Dreamer is compared to D4PG (Barth-Maron et al., 2018), a strong model-free method, and achieves a higher score using less time steps. The authors contribute this data-efficiency to the learned world model, that is able to generalize well with little data.

On Atari, Dreamer is compared to Rainbow (Hessel et al., 2017) and DQN (Mnih et al., 2013). While Dreamer outperforms the model-free methods on some games, it does not on others, indicating that further research to improve stability is necessary.

2.2.3 Direct Future Prediction

Direct Future Prediction (DFP) (Dosovitskiy and Koltun, 2016) is a model-based RL method, that won a VizDoom AI competition (Dosovitskiy and Koltun, 2016) in 2016.

The method predicts dense, high-level features of the environment multiple timesteps into the future, which it uses to optimize for a parameterized goal.

Making good models that can accurately predict future states of the environment is

(23)

difficult. Some model-based methods, such as Dreamer, only attempts to predict latent state representations, instead of comprehensive state predictions or true observations.

The intuition behind limiting the scope of prediction is to lessen the workload on the prediction model, so that it can more accurately predict what is important. Direct Future Prediction takes this intuition to the extreme. Instead of predicting an observation or a latent state, it predicts a few high-level features: ”measurements”.

The measurements are a handpicked subset of the observations, reward and or environmental state. In the original DFP paper (Dosovitskiy and Koltun, 2016), DFP is applied on the VizDoom environment, and the measurements are ammunition, kills, health and health packs. These measurements provide high-level feedback from the environment, while being concise in dimensionality. Contrary to other model-based methods, the model output (the measurement vector) can not be used as input for further predictions. This limits the method to make predictions with a set depth into the future.

The architecture is relatively straight forward. It is illustrated in figure 2.6, taken from the original DFP paper (Dosovitskiy and Koltun, 2016). Image-observations are passed through a perception module S. Similarly, the measurements and goal is passed through the measurement module M and the goal module G, respectively. The output of the modules are concatenated into a joint input representation j.

Figure 2.6: The DFP architecture. Figure from Dosovitskiy and Koltun (2016)

The training and inference phase

It is important to know the difference between the training phase and the inference phase to understand later experiments. In DFPs training phase, an -greedy policy (section 2.1.3) is used. The -greedy policy sometimes chooses actions at random, to facilitate better exploration and collect more diverse data, so DFP can improve. In practice, this

(24)

has two important effects. Firstly, the training phase will sometimes score better than the model could otherwise, if DFP have not learned the best solution yet: The-greedy policy will find a good solution by chance. Secondly, if DFP has learned a perfect solution, then it will score worse in training because the randomness introduced by the -greedy policy will be limiting.

In inference, DFP simply always chooses the action that it predicts is best, according to the goal vector. It does not use the -greedy policy, or alternatively, the parameter is zero. For example, the measurement vector might be [apple, banana]. For the action right, the predicted future measurement vector is [0.3, 0.4]. For the action left, the predicted future measurement vector is [0.7, 0.2]. If the goal vector is [0.9, 0.1], DFP will choose actionleft as 0.7∗0.9 + 0.2∗0.1>0.3∗0.9 + 0.4∗0.1.

The sample goal technique

DFP requires a goal vector, which is usually chosen by the programmer and static through the learning and inference phases. However, the authors of DFP also present another approach to the goal vector: goal-agnostic training. In goal-agnostic training (referred to as the sample-goal technique in this paper), the numbers in the goal vector is uniformly sampled at random from the range [0, 1], every episode. Alternatively, the range [-1, 1]

can be used, if it is desirable for the agent to learn to minimize certain measurements.

The purpose of the sample goal technique is for the agent to learn to use different goals, so that different goals can be used in inference, and make the agent able to generalize to new tasks.

2.3 Intrinsic motivation, and estimating uncertainty with an ensemble

This section briefly describes intrinsic motivation, and then explains a recent intrinsic motivation method that estimates uncertainty with an ensemble. The method is later combined with DFP.

2.3.1 Intrinsic motivation

The idea of intrinsic motivation (IM) is to develop an agents behaviour not by extrinsic (environmental) reward, but by intrinsic curiosity, or satisfaction. IM is inspired by how human babies curiously explores the world: not to receive positive feedback but from an inner motivation.

The field of intrinsic motivation is vast and contains many different methods. Oudeyer and Kaplan (2009) propose a classification of IM with two main classes: Knowledge-based

(25)

and competence-based methods. In knowledge-based methods, the IM learning signal is based on the error of a predictive model. If the model error is high, the learning signal is also high, to encourage the agent to spend more time in that uncertain area of the state space. In competence-based methods, the learning signal is based on the rate of improvement of the model. The agent is encouraged to explore in areas where it is able to improve.

2.3.2 Estimating uncertainty with an ensemble

Knowledge-based methods have a critical weakness, in that they do not respond well to noise. If an environment has a source of random noise, the agent will be encouraged to explore this source of randomness as the model error is high. However, the effort is wasted because there is nothing to learn. The problem is illustrated by the Noisy-TV problem (Burda et al., 2018): If an environment contains a TV broadcasting random noise, the agent will get a high intrinsic reward watching the TV. It would fail to make meaningful progress, be forever attracted to the noisy-TV and become a ”couch potato”. In environments with a noisy TV or other sources of unevenly distributed noise, competence-based methods would be favourable. However, they have been hard to implement in practice until recently.

A recent technique in intrinsic motivation is to estimate ”disagreement” with an ensemble of predictive models. As the models learn to predict the environment better, they will converge to similar solution. As such, the rate of model improvement can be estimated by computing the disagreement in an ensemble. This approach can overcome the noisy-TV problem, as the models in the ensemble will converge towards uniform predictions, and low disagreement. Several recent papers have implemented this idea.

Model-Based Active Exploration (MAX) (Shyam et al., 2018) implements the method and test it on a hard-exploration ”chain” environment. They show that MAX significantly beats other exploration method. They also put a stochastic trap (noisy-TV) in the environment, and showed that MAX was able to overcome the trap, although it slowed it down. MAX consists of a training phase, where it does not use any extrinsic reward, and is just focused on maximizing the intrinsic reward from the ensemble disagreement.

In inference, it can adjust to use the extrinsic reward instead.

Pathak et al. (2019) implements a similar method, and tests it on stochastic-Atari, Mu- joco, Unity and on a physical robot.

Sekar et al. (2020) implements the same idea in Plan2Explore. Plan2Explore builds on the PlaNet model (Hafner et al., 2018), and Dreamer (section 2.2.2). The PlaNet model learns a latent state representation, that Plan2Explore learns to predict for future states.

The variance between the members of the ensemble, Latent Disagreement , is calculated

(26)

and used as an intrinsic motivation learning signal. Similarly to MAX and Pathak et al.

(2019), they use a purely exploratory training phase, in which environmental reward is ignored. To adapt to different tasks in inference, Plan2Explore implements zero-shot and few-shot task adaption. The zero-shot task adaption calculates the task-specific reward using the reward function and the data in the replay buffer. The few-shot method plays additional episodes to calibrate further, to enrich the replay buffer with task-specific episodes. The method was tested on continous control tasks in the DeepMind control suite (Tassa et al., 2018), where it outperformed prior self-supervised exploration methods.

(27)

Chapter 3

Methodology

This chapter describes the methodology in the thesis. The chapter covers how to test DFP: What environments are used, and how to visualize the policy. Ensemble-DFP is also introduced: A new method developed as part of this thesis, combining DFP and intrinsic motivation.

3.1 Testing Direct Future Prediction

This section describes the environments that Direct Future Prediction (DFP) (Doso- vitskiy and Koltun, 2016) is tested on, why they are chosen and issues that have to be considered when applying DFP. The section also describes policy visualization techniques, for each of the environments.

3.1.1 Environments

One research question is to test DFP in other environments, and to study what can be learned. The environments chosen for this thesis are MountainCar and MiniGrid. These are simple environments that are less complex and computationally demanding than VizDoom. Obando-Ceron and Castro (2020) argues that such small-scale environments can yield valuable scientific insights, if the results are analyzed properly.

(28)

Figure 3.1: An example map layout in the MiniGrid environment. The red arrow is the agent, which is currently facing right. The green tile is the goal. The highlighted squares (lighter tone) is visualizing the current 7x7 observable area. In this environment, reaching the goal gives a reward and terminates the episode.

MiniGrid (Chevalier-Boisvert et al., 2018) is a lightweight, discrete grid world environment, where the agent moves one square or turns, per time step. Figure 3.1 shows an example map layout. It is an existing, popular environment, see (Chevalier-Boisvert et al., 2018) for a list of published citations. The objective is to reach the goal or collect coins, which may require overcoming different obstacles and solving puzzles. However, in this thesis, the map layouts are kept simple in order to make them solvable within limited time. The environment was chosen because it efficiently tests planning-ability and exploration. The goal can be far away and out of sight for the agent. Thus, exploration and the ability to plan many timesteps ahead is required. VizDoom, which DFP was tested on originally (Dosovitskiy and Koltun, 2016), is on the contrary arguably a reactive environment, that can be solved by reacting to what is currently in view. For example, if several monsters are on screen in the VizDoom environment, the optimal behaviour to learn is to shoot the closest monster, and immediately continue with the next one. This requires the agent to master the mechanics of the game, but does not involve planning far into the future. By keeping the MiniGrid environment simple, planing can be tested in isolation and with better computational efficiency. By designing a map with multiple goal-tiles, it is also easy to test multi-task performance.

Several observation formats are available, but a partially observable vector format was chosen. Specifically, the observation contains the contents of the 7x7 tile group in front of the agent. This format was chosen because it is less computationally demanding than

(29)

pixel observations, and because it demands exploration by not always having the goal in view. The action space is limited to only contain 3 actions: Move forward, turn left and turn right.

Three maps were designed to test the methods in different levels of exploration and multi- task performance, see figure 3.2. The small map (3.2a) is a map consisting of a 3x3 grid, where the upper left corner is a set starting position, and a green goal is always located in the bottom right. This is a simple map used for implementation verification. The medium map (3.2b) is a map with two different goals, so multi-task methods can be tested. The map is small enough so both goals will be found by random exploration. The map also has a 3x3 starting area, in which the starting position will be chosen at random. Finally, the large map (3.2c) is a harder version of the medium map, where efficient exploration is required to find the blue goal.

(a) The small map (b) The medium map

(c) The large map

Figure 3.2: The three maps designed in the MiniGrid environment. The gray arrow is the agent, and the green and blue tiles are the green and blue goals, respectively. The highlighted areas inside the white border is the starting area. The orange tiles are lava, which will terminate the episode if entered.

In order to apply DFP on the MiniGrid environment, a set of measurements have to be chosen. There are several possible measurements, for example distance to goal, distance

(30)

to lava, or count methods to encourage exploration. To keep it simple, and keep the environments hard, the measurement vector was defined as m= [g, b], where g and b are the accumulated green and blue reward, respectively.

MountainCar is a continuous environment where the agent controls a car, that starts in a valley between two hills, see figure 3.3. The goal is to reach the top of the right hill, but the car does not have enough power to get there immediately. The agent first has to drive back and forth, to build up speed, before reaching the top is possible. The reward is r = −1 for each time step, and r = 0 when the agent has reached the goal, and the episode is terminated. The time step limit is 200, so if the goal is not reached, the cumulative episode reward is -200. If the goal is reached, the cumulative episode reward is the number of time steps it took to reach the goal, multiplied by -1. There is not a static optimal solution as the starting position is stochastic, but the environment is considered solved if the average reward over time is more than -110.

Figure 3.3: The MountainCar environment. The flag in the top right corner is the goal. The position is measured along the x-axis of the image, and spans the range [−1.2,0.6]. The highlighted area in green marks the range [−0.6,−0.4], which is the starting area. The starting position is uniformly sampled within this range. Note that in the starting position, the cars wheels are always on the track.

The MountainCar environment is a well known environment, and a text book example of an environment with sparse rewards. A reward is sparse when the reward signal

(31)

is sparsely distributed in time. This is particularly true in MountainCar, where the reward is constant for every time step but the last. The sparse reward and computational efficiency is why the environment was chosen to test DFP on. In DFP, the goal vector can be designed in different ways, which might help in the sparse reward setting. Similarly to MiniGrid, the MountainCar environment also tests long term planning, because the goal can only be reached many time steps in the future.

The observations in MountainCar are a two-dimensional vector o = [p, v], where p is the current position of the agent, and v is the current velocity of the agent. Position is measured along the x-axis of the map, as illustrated in figure 3.3. v is the velocity parallel to this axis, and is negative or positive depending on directionality. The DFP measurement vector was chosen to be m= [p, v, r], where r is the reward.

An issue to keep in mind with DFP is that the predictions are done several time steps in the future. Assume that DFP predicts the future measurement vector [m_t+16, m_t+32], where t is the current time step. Notice that if the reward is received at a time step t < 16, and the episode is terminated, then the reward is actually not included in the predicted measurement vector. To avoid this issue, if the episode is terminated, future measurements are generated as copies of the last measurement. In the case of MiniGrid, the episode is implemented not to terminate when the agent finds the goal, but to reset the position of the agent to the start area, and permanently increment the reward. For example, if the reward for finding a goal is r=2, every future reward r_f will now be 2 higher (r_f :=r_f + 2) This incentives the agent to find the shortest path, as it will allow the agent to find the goal multiple times in the same episode. The episode termination is not delayed in MountainCar, as it is generally not possible to reach the goal twice within the episode length limit of 200 time steps, and the reward already incentives fast solutions.

3.1.2 Policy visualization

In order to better understand why some DFP configurations fail, and others succeed, it is useful to visualize the policy (section 2.1.2). The policy is visualized by plotting the state space, together with the action chosen by the policy for each plotted state. A policy visualization method was implemented for each environment.

Policy visualization on MiniGrid

The action space in this thesis implementation of MiniGrid contains three actions: Turn left, turn right, and forward. The state space is determined by two properties: which tile the agent occupies in the grid, and the agents current direction. One way the policy can be visualized is to plot 4 versions of the map, one to each direction to agent can be

(32)

Figure 3.4: An example policy plot for MiniGrid

facing: Right, down, left and up.

An example policy plot is shown in figure 3.4. The plot fully describes the policy for each state. The gray arrows indicate that the policy is to turn to that direction, and the red arrows indicate that the policy is to move forward. Starting in the upper left corner in the right state, the following action sequence will be played: Forward, forward, turn to face down, forward, forward (and reach the goal).

Note that the state space is not fully represented, as the current measurement input and goal vector is not considered. However, in the experiments the goal vector is static within each episode, and the measurement vector makes little difference. Nevertheless, the policy might change slightly when a goal is reached because of the cumulative reward measurement.

(33)

Figure 3.5: An example policy plot for MountainCar

Policy visualization on MountainCar

The MountainCar action space contains three actions: accelerate left, accelerate right and the no-op (no change). The state space is continuous and two-dimensional: It consists of the car position and the car velocity. The policy can therefore be visualized in a plot, where the color determines the action.

Figure 3.5 shows an example policy plot on MountainCar. The plot shows the entire state space from minimum to maximal values. Following this plot, the agent will mostly accelerate left when the velocity is negative (when the car is going left), and accelerate right when the velocity is positive (when the car is going right). The policy also chooses the no-op action when the position is close to the edges.

Good policies on MountainCar builds acceleration, because high speed is required to reach the goal at the top of the right hill. Building acceleration is done by accelerating in the direction the car is moving. Therefore, the policy in figure 3.5 appears reasonably good.

3.2 Ensemble-DFP

Ensemble-DFP is a novel augmentation of DFP, developed as part of this thesis. It is designed with the intent of improving exploration. The idea is to combine the computationally efficient predictions of DFP, with an intrinsic motivation uncertainty signal, computed by an ensemble approach. Section 2.3 presents the ensemble techniques and

(34)

some recent methods that implement it.

DFP has some characteristics that makes it an interesting method to combine with the intrinsic motivation technique. A problem with existing methods is the need to calibrate the policy after training, before it can be used in inference. Plan2Explore (Sekar et al., 2020) implements a zero-shot calibration methods that can optimize the policy to a new goal without further environmental interaction, by resampling the replay buffer. Plan2Explore also implements a few-shot approach, in which the agent plays some episodes to calibrate.

The zero-shot method is faster, while the few-shot method gives better performance but cost significant compute and time. An advantage of DFP is that the goal is parameterized as an input vector that can be changed dynamically, which lets the agent adjust to new goals online. This removes the need for the zero-shot and few-shot methods.

The next sections describe the Ensemble-DFP architecture, the training and inference phases and challenges that have to be considered about scaling measurements.

3.2.1 Ensemble-DFP architecture

Figure 3.6: The original DFP architecture. Figure from Dosovitskiy and Koltun (2016)

Direct Future Prediction (Dosovitskiy and Koltun, 2016) uses a relatively simple network architecture. Figure 3.6 shows the architecture layout, which is taken from the original paper (Dosovitskiy and Koltun, 2016). The inputs consist of an image s, a measurement vectorm and a goal vectorg. They are processed by a convolutional networkS, and the fully connected networks M and G, respectively. The outputs of the input networks are merged into a joint vectorj. j is then split into an action independent stream E, and an action dependent stream A, both predicting vectors of future measurements.

In order to keep Ensemble-DFP comparable to DFP, the Ensemble-DFP architecture is

(35)

Figure 3.7: The ensemble-DFP architecture. The Expectation and Action streams have been replicated to form an ensemble of size n. The figure is based on a figure from the original DFP paper (Dosovitskiy and Koltun, 2016).

(36)

kept similar to the DFP architecture. However, some changes has to be made in order to implement ensembles. An ensemble is a set of similar networks. The simplest way to implement Ensemble-DFP would be to make n copies of the entire network, for an given ensemble size n. However, this approach is not computationally efficient. The input networks (S, M, G) would be replicated n times and thus require n times the computational budget to be trained. Moreover, this computation would be spent at better representing the inputs, not at directly increasing diversification in prediction.

The prediction is done by the networks E and A, therefore these networks are chosen to be replicated in the ensemble. Figure 3.7 shows the DFP-ensemble architecture. The architecture is similar to the original DFP architecture, however the networks after the joint input vector are replicated to form an ensemble.

Ensemble-DFP was tested on the MiniGrid and MountainCar environments. These environments have different types of observations and measurements. Therefore, the input networks must be adapted to be compatible with each environments. Specific network structures for each environment are presented in section 4.3.

3.2.2 Ensemble-DFPs training and inference phases

In DFP, the difference in the training and inference phases is minimal. The only difference is that in training, DFP is on a-greedy policy to increase exploration. In inference, DFP uses a greedy policy using its predictions and goal vector. The difference in these phases is more stark for Ensemble-DFP. During training, Ensemble-DFP is purely exploratory.

It uses the variance of the ensemble predictions to estimate the action with the highest uncertainty, and picks that action. In fact, the goal vector is not relevant in the training phase. In inference, Ensemble-DFP uses the goal vector and the same greedy policy as DFP.

The training phase

The training phase is purely exploratory. The training policy is not affected by the goal vector as the policy just picks actions according to how uncertain the ensemble is about the actions future. Note that the -greedy policy is used in the Ensemble-DFP training phase, as it is in DFP. Since the training phase is exploratory, it is possible that Ensemble-DFP would do better without -greedy entirely, or with a more rapidly decayingschedule. These questions are left for future work. Having a purely exploratory training phase is similar to other methods using the ensemble uncertainty approach (see section 2.3).

The training policy is implemented in figure 3.8. The function has the predicted future measurements as an input, and calculates the action with the highest uncertainty. Un-

(37)

1 import numpy as np

2 def training_policy(p_np, action_space, v_l):

3 ''' The training policy: Calculate the most uncertain action

4

5 Parameters:

6 p_np : Numpy vector with predicted future measurements

7 with dimensionality (E, F, A). E is the ensemble

8 size, F is the number of future coefficents and

9 A is the action space

10 action_space : The action space (the number of actions)

11 v_l : A parameter to set the threshold for

12 variance suppression.

13

14 Returns:

15 Index of the most uncertain action

16

17 '''

18

19 v = np.zeros(action_space) #variance per action

20 for action in range(action_space):

21

22 norm_scale = np.max(np.abs(p_np[:,:,action]))

23 if norm_scale < v_l:

24 norm_scale = 1 # Suppress low variances

25

26 v[action] = np.var( p_np[:,:,action]/norm_scale )

27

28 # Return the action with the highest variance

29 return np.argmax(v)

Figure 3.8: Python code implementing the training policy. The code is simplified by having removed batching-support.

(38)

certainty is estimated by the variance of the predicted future measurement vectors within the ensemble.

It is important to note that variance is not scale independent: Large numbers with small differences (in percentage) have higher variance than low numbers with big differences, in percentage. If this is not accounted for, the training policy would favor picking actions leading to states with high measurement values, instead of actions it is uncertain of.

There are many ways to handle the problem. The measurements could be adjusted to have a similar range of values, or they could be normalized by a rolling mean or other methods. In this implementation, a simple normalization technique was chosen. For each action, the measurements are divided by the highest measurement, as can be seen in figure 3.8 (lines 22-26).

However, this normalization technique does not filter noise. When the predicted values are close to zero, the predicted measurement values can vary by several orders of mag- nitude, empirically. To avoid this issue, a ”lowest value” parameter v_l was implemented.

If the maximum predicted measurement for a given action is below this value, then normalization is not performed, and the variance will be close to zero. This is implemented in figure 3.8 lines 23-24. Through empirical testing, the v_l parameter was chosen to be 0.05 for all experiments. Full parameter tables are found in section 4.3.

The inference phase

In inference, Ensemble-DFP tries to achieve the best measurements according to the goal vector, instead of exploring. The -greedy policy is not used either. Ensemble-DFPs inference phase is equal to DFPs inference phase. However, since Ensemble-DFP has the ensemble architecture, a method is needed to produce a single predictionp_s out of the n ensemble predictions. This is done by calculating the mean of the ensemble predictions, and using that prediction p_s as the input to the inference policy. In DFP, the inference policy is simply to pick the action that has the highest predicted future measurement vector, in relation to the goal vector. Equation 3.1 illustrates the process of calculating p_s.

Letp1,p2, ... ,pnbe future measurement predictions from thendifferent ensembles. The combined predictionp_s is:

p_s= p₁ +p₂+...+p_n

n (3.1)

(39)

Chapter 4

Implementation

This chapter describes technicalities relevant to the implementation, and lists the DFP parameters, and network architecture tables.

4.1 Technical setup

The thesis is implemented in Python 3, with the PyTorch machine learning framework.

The environments are openAI environments. Environment wrappers were used in order to generate the DFP measurements from observations, reward and environment state.

4.2 Customizing the MiniGrid environment

In Chapter 3, MiniGrid (Chevalier-Boisvert et al., 2018) and MountainCar were presented as the chosen environments for this thesis. Both are existing environments. The Moun- tainCar environment was not changed, as it is compatible with DFP and the experiments.

However, the MiniGrid environment was modified in several different ways. First, the action space was limited to 3 actions: Move forward, turn left and turn right. The other actions such as pick-up, drop and toggle was removed. MiniGrid also only has one type of goal, rendered as a green tile. Since a research question is to test multi-task problems, an additional goal was implemented, which is rendered as a blue tile and functions the same as the green goal.

4.3 DFP and Ensemble-DFP parameters

DFP has several parameters related to prediction, optimization and the network architecture. As the authors of DFP (Dosovitskiy and Koltun, 2016) already performed some hyper parameter testing, and got good enough results to win a track of the Visual Doom

(40)

AI competition in 2016, few parameters were changed. Ensemble-DFP uses the same parameters as DFP, unless otherwise specified.

Parameter Value used in this thesis Value in DFP (Dosovitskiy and Koltun, 2016)

Min horizon 1 4

Future steps [1, 2, 4, 8, 16, 32] [1, 2, 4, 8, 16, 32]

Temporal coefficients [0, 0, 0, 0.5, 0.5, 1.0] [0, 0, 0, 0.5, 0.5, 1.0]

Goal space pos-neg Varies per experiment

Table 4.1: The prediction parameters

Table 4.1 shows the parameters related to prediction. Future steps determines both how many steps each forward call predicts, and how far into the future those steps are. The parameter value predicts 6 future timesteps, from 1 time step into the future to 32 time steps ahead. The temporal coefficients weigh the predictions at these timesteps, before they are used to determine the next action. The reason that some temporal coefficients are zero, it that the authors of DFP (Dosovitskiy and Koltun, 2016) found that auxiliary prediction can be useful, as they increase increase the dimensionality of the supervisory signal that is used during training. Min horizon is a parameters that sets the minimum remaining episode length that is required for a transition to be chosen from the replay buffer. Since being close to an episode end can indicate that the agent is about to move into a trap or solve the environment in this thesis’ environments, this value is set to 1.

This is possible of the technique used in this thesis that replicates environmental feedback for out-of-episode timesteps, described in section 3.1.1. The goal space can be either pos- neg or pos, determining if the goal vector can contain positive and negative values, or only positive values, respectively. This is relevant when the sample goal technique (2.2.3) is used.

Batch size 64 64

Learning rate 0.0001 0.0001

Scheduler step 0.5 0.3125

Scheduler decay 0.2 0.3

Table 4.2: The optimization parameters

Table 4.2 shows the parameters related to optimization. Some empirical experimentation was performed to optimize the learning rate schedule, but results were not significantly improved. The scheduler step determines the fraction of total training steps that needs to be run before the learning rate is multiplied with Scheduler decay. The scheduler step value of 0.5 means that the learning rate will decay by 0.2 once, when half of the training steps have been executed.

(41)

Replay capacity 20000 20000

Epsilon start 1.0 1.0

Epsilon end 0.15 0.15

Ensemble size 5 -

vl 0.05 -

Table 4.3: Other parameters

Table 4.3 shows other DFP parameters. Replay capacity is the size of the replay buffer.

Epsilon in -greedy policies scales linearly from Epsilon start to Epsilon end. The En- semble size is the number of members in the ensemble. The ensemble size was chosen to be 5, which is the same value that Plan2Explore used (Sekar et al., 2020). vl is the measurement scaling parameter that was discussed in section 3.2.2.

Module Layer dimensions Perception

1029 128 128

Measurement m 128 128

Goal

m*6 128 128

Expectation 384 512 m*6

Action

384 512 m*18

Table 4.4: Network architecture for MiniGrid networks

(42)

Module Layer dimensions Perception -

Measurement 3 128 128

Goal

18 128 128

Expectation 256 512 18

Action

256 512 54

Table 4.5: Network architecture for MountainCar networks

Tables 4.4-4.5 shows the layer dimensions for the DFP (and ensemble-DFP) networks used in this thesis. Each module has three layers, which is why the table has three numbers per module. Otherwise, the network is equal to the network used in the original DFP paper (Dosovitskiy and Koltun, 2016). See figure 2.6 for an illustration of the module layout. Note that Dosovitskiy and Koltun (2016) used a convolution neural network as the Perception module. MiniGrid observation are one-hot encoded vectors, so a fully connected network similar to the other modules are used instead. The MountainCar network does not have a perception module, as all the observations are used as measurements. As a different number of measurements m are used in the MiniGrid experiments, the layer sizes varies slightly between each configuration.

(43)

Chapter 5

Experiments and Results

5.1 Verifying the implementation

To verify that the implementation of Direct Future Prediction is correct, DFP was tested on a small 3x3 map (Figure 5.1) in the MiniGrid environment (3.1.1). The start position, direction and the position of the goal is constant for each episode: The map is completely deterministic. Remember that when the agent (the red arrow) reaches the green square, the agents position and direction is reset to the start configuration, but the episode is not terminated. This map was used with a time step limit of 100. There is only one measurement here: m = [g], where g is the number of times the agent has reached the green goal. The experiment was done with two different goal vectors, and the sample-goal technique (2.2.3). An overview of the configuration are presented in Table 5.1.

Solving Long Term Planning Problems with Direct Future Prediction

Solving Long Term Planning Problems with Direct Future Prediction

Sindre Thoresen Kai Olav Ellefsen

Jim Torresen

Thesis submitted for the degree of

Master in Informatikk: Robotikk og intelligente systemer 60 credits

Institutt for informatikk

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Solving Long Term Planning Problems with Direct Future

Prediction

Sindre Thoresen Kai Olav Ellefsen

Jim Torresen

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1 Motivation

1.2 Research questions

1.3 Thesis structure

Chapter 2

Background

2.1 Reinforcement learning

2.2 Recent model based reinforcement learning methods

2.3 Intrinsic motivation, and estimating uncertainty with an ensemble

Chapter 3

Methodology

3.1 Testing Direct Future Prediction

3.2 Ensemble-DFP

Chapter 4

Implementation

4.1 Technical setup

4.2 Customizing the MiniGrid environment

4.3 DFP and Ensemble-DFP parameters

Chapter 5

Experiments and Results

5.1 Verifying the implementation