• No results found

Analyzing Approaches to the Problem of Avoiding Side Effects in Autonomous Agents

N/A
N/A
Protected

Academic year: 2022

Share "Analyzing Approaches to the Problem of Avoiding Side Effects in Autonomous Agents"

Copied!
88
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Analyzing Approaches to the Problem of Avoiding Side Effects in

Autonomous Agents

Serwa Waisi

Thesis submitted for the degree of

Master in Informatics: Programming and System Architecture

60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)
(3)

Analyzing Approaches to the Problem of Avoiding Side Effects in Autonomous Agents

Serwa Waisi

(4)

c 2020 Serwa Waisi

Analyzing Approaches to the Problem of Avoiding Side Effects in Autonomous Agents

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

Reinforcement learning is a field within machine learning which trains agents to solve problems by trial and error. A reinforcement learning agent interacts with the environment through actions. Depending on the outcome of these actions, the agent can either be rewarded or penalized. The overall goal of the autonomous agent is to maximize its objective function, which means obtaining the maximum amount of rewards. While the agent follows its objective function, its action can cause unexpected harm or unwanted changes to the environment and thus breach the safety. These unwanted events that might occur are defined as negative side effects.

The focus in this work is to analyze approaches to avoid negative side effects and find solutions for this problem, without integrating ad- hoc constraints in the agent’s objective function. These constraints are not integrated since we want the agent to have a general knowledge of solving side effects problems, as it is impossible to foresee all actions that could cause unexpected harm or unwanted changes in a real world environment.

The experimental part compares two environments implemented in different settings, one in a safety setting and one in a standard setting.

The agent in the safety setting is not directly aware of the penalization for causing side effects, as it is not a part of its objective. On the contrary, the agent in the standard setting is directly penalized and knows the consequence of causing harm. The experimental part shows that the agent in the standard setting performs well and is close to learning the optimal behaviour.

Meanwhile, the agent in the safety setting does not behave as desired and fails to solve the problem of side effects.

(6)
(7)

Acknowledgement

I would first thank my thesis advisor, Postdoctoral Fellow Fabio Massimo Zennaro, of the Department of Informatics(IFI) at UiO. His advice and support have made this thesis possible. Zennaro has been a reliable source by guiding me throughout the year and handed me constructive feedback.

I would also thank my family and friends for the support and for encouraging me while writing this thesis. A special thanks to Sairan Waisi, Kavita Bhamra and Lucas Paruch for proof-reading and giving me valuable comments.

Author Serwa Waisi

(8)
(9)

Contents

1 Introduction 1

1.1 Problem statement . . . 2

1.2 Motiviation and research goal . . . 4

1.3 Structure . . . 5

2 Background 6 2.1 Reinforcement learning . . . 6

2.1.1 Sub-elements of Reinforcement Learning . . . 8

2.1.2 Markov Decision Process . . . 11

2.1.3 Reinforcement learning methods . . . 13

2.1.4 Dynamic Programming . . . 14

2.1.5 Monte Carlo method . . . 15

2.1.6 Temporal Differences . . . 16

2.2 AI Safety . . . 19

2.2.1 The EU Commission and AI Safety . . . 19

2.2.2 Compromising AI Safety . . . 22

2.2.3 Negative side effects . . . 23

2.3 Related work . . . 25

3 Methodology 28 3.0.1 Algorithms . . . 28

3.1 Reinforcement learning in a standard setting: The Cliff- walking environment . . . 32

3.2 Reinforcement learning in a safety setting: side effects Sokoban 34 3.2.1 The side effects Sokoban . . . 34

(10)

3.2.2 The environment . . . 34

3.3 Frameworks . . . 36

4 Implementation 37 4.1 OpenAI Gym . . . 37

4.2 The Cliff-walking environment . . . 38

4.3 The side effects Sokoban environment . . . 38

4.3.1 The OpenAI Gym wrapper . . . 39

4.4 The Q-learning and the Sarsa function . . . 39

4.4.1 Modification of the Sarsa function . . . 40

5 Results 41 5.1 The Cliff-walking agent trained with Sarsa and Q-learning . . 42

5.2 The side effects Sokoban agent trained with Sarsa and Q-learning 44 5.3 The side effects Sokoban environment with and without the wall penalty . . . 46

5.4 Comparing the two environments . . . 49

5.4.1 The return of rewards with Q-learning and Sarsa . . . 50

5.4.2 The episode lengths with Q-learning and Sarsa . . . 51

5.4.3 The Cliff-walking task and the side effects Sokoban with wall penalty . . . 53

6 Discussion 55 6.1 Comparing the results from the Q-learning and Sarsa methods 56 6.1.1 Sarsa and Q-learning: The Cliff-walking problem . . . 56

6.1.2 Sarsa and Q-learning: The side effects Sokoban environment . . . 56

6.2 Comparing the side effects Sokoban environment . . . 57

6.3 Comparing the two environments . . . 58

6.3.1 Without the wall penalty . . . 58

6.3.2 With the wall penalty included in training the side effects Sokoban agent . . . 59

6.4 Comparison of the results with the literature review . . . 60

(11)

7 Conclusion 62 7.1 Limitations of the study . . . 64 7.2 Future work . . . 65

(12)
(13)

List of Figures

2.1 An illustration on how a reinforcement learning agent interacts

with the environment[1]. . . 7

2.2 The figure shows the difference between model-based and model-free reinforcement learning [2] . . . 10

2.3 An illustration of a MDP. The agent takes an action at each step which results in changing its state in the environment and receiving a reward[2] . . . 12

2.4 The key elements in a backup diagram. Redrawn from Roy[3] 13 2.5 The backup diagram for Dynamic Programming[4] . . . 14

2.6 The backup diagram for the Monte Carlo Method[4] . . . 15

2.7 The backup diagram for Temporal Difference learning[4] . . . 16

2.8 The backup diagram for the Sarsa method[1] . . . 17

2.9 The backup diagram for the Q-learning method[1] . . . 18

2.10 Requirements for trustworthy AI . . . 20

3.1 Figure of the Cliff-walking problem . . . 32

3.2 The paths for the Sarsa method and the Q-learning method . 33 3.3 The game interface for the side effects Sokoban environment . 35 5.1 Comparison of the return of rewards for the Cliff-walking environment after training with Sarsa and Q-learning. . . 42

5.2 Comparison of the episode lenghts for the Cliff-walking environment after training with Sarsa and Q-learning. . . 42 5.3 Comparison of the return of rewards for the side effects

Sokoban environment after training with Sarsa and Q-learning. 44

(14)

5.4 Comparison of the episode lengths for the side effects Sokoban environment after training with Sarsa and Q-learning. . . 44 5.5 Return of rewards for the side effects Sokoban with and

without the wall penalty. . . 47 5.6 Episode lengths for the side effects Sokoban with and without

the wall penalty. . . 47 5.7 Comparison of the received rewards for the Cliff-walking

environment and the side effects Sokoban using the Q-learning method. . . 50 5.8 Comparison of the received rewards for the Cliff-walking

environment and the side effects Sokoban using the Sarsa method. . . 50 5.9 Comparison of the episode lengths for the Cliff-walking

environment and the side effects Sokoban using the Q-learning method. . . 51 5.10 Comparison of the episode lengths for the Cliff-walking

environment and the side effects Sokoban using the Sarsa method. . . 51 5.11 The return of rewards for the Cliff-walking environment and

the side effects Sokoban with wall penalty trained with Sarsa. 53 5.12 The episode lengths for the Cliff-walking environment and the

side effects Sokoban with wall penalty trained with Sarsa. . . . 53

(15)

Chapter 1 Introduction

Machine learning and artificial intelligence (AI) have been of great interest for several decades, and the interest has only increased in recent years. Machine learning and AI are a great contribution to most technologies nowadays, automating several applications and tasks that were previously performed by humans. The main aim of machine learning is to understand the structure of data and fit the data into models that humans can understand and utilize[5].

In traditional computing, algorithms are sets of programmed instructions used by computers to calculate or problem solve. On the other hand, machine learning algorithms allow computers to train on data inputs and use statistical analysis in order to output values that fall within a specific range[5]. Thus, machine learning aids computers in building models from sample data in order to automate decision-making processes based on data input[5]. The adoption of machine learning and artificial intelligence offers many possibilities. This technology is now contributing to most technologies as plenty of repetitive tasks and decision-making tasks that were previously done by humans have now been replaced by autonomous agents. As a result, machine learning has reduced the workload of humans and many companies have saved on operational cost. Another benefit of AI technologies is that it can replace many of the jobs in hazardous environment such as searching for landmines[6] and monitoring offshore subsea structures in the oil and gas industry[7].

(16)

1.1 Problem statement

Reinforcement learning is a subfield of machine learning and the main topic of this thesis. In reinforcement learning, the autonomous agents learn by trial and error rather than being trained on collected data. The agents are provided with an objective function. By trying to achieve their objective as best as they can, they learn their desired behaviour. The main aim of the agent is to maximize its cumulative reward. The agent interacts with the environment and learns an optimal policy over time. An agent following its policy, which is the agent’s behaviour, will at each time-step observe a state and select an action based on the policy. As a result of this action, the agent will obtain a scalar reward and also transition to the next state according to the environment dynamics[8].

In reinforcement learning, there are evaluative feedbacks. Even though the feedback is evaluated, it does not indicate if the decision was correct or not, only that some actions are better than others[8]. There are several challenges when it comes to finding the best policy for the agent. The agent needs to take the best action in a given state, but also needs to consider which action yields most rewards in the long run. The reinforcement learning agent, in a given environment, needs to explore to find a possible better selection of action in the future, but also exploit its current knowledge for the state. It is necessary to find a balance between exploration and exploitation to maximize the reward and also find the optimal policy. The reinforcement learning problem is formalized as a Markov Decision Process(MDP). MDP describes a fully observable environment in reinforcement learning and defines the interaction between an agent and the environment in terms of states, actions and rewards. The goal is to control the system so that some performance criteria is maximized. A MDP is solved when the optimal policy is found.

AI is becoming more and more integrated with our everyday life and responsible for making decisions that may effect the safety of personnel, assets or the environment[9]. Having autonomous cars driving on behalf of humans is no longer a distant goal, and scientists are now closer than ever to achieve this. These autonomous systems need to learn how to avoid colliding with objects, animals and humans when driving to get from A to B. One of the

(17)

main concerns is the safety and how the cars must be able to recognize harmful events and avoid the risk of causing harm. Another issue is that humans are also usually biased, which may affect the design and creation of autonomous system by using data they prefer. An example of this is self-driving cars and the ethical dilemma of how the car should react in the event of a fatal collision and how to design these choices[10]. Many autonomous systems that are being developed now are getting more complex with features that are potentially harmful if it is not extensively tested for any given situation. Therefore, it is important to ensure safe use of AI in systems[9]. Understanding the technical challenges in AI systems will lead to more successful development of useful, relevant and important systems.

While an agent is learning its behaviour, there can be a risk that the agent can do actions that may have unforeseen negative side effects. The negative side effects can be a consequence of the fact that the designer of the method specifies an incorrect or incomplete objective function[11][12].

Even though maximizing the objective function can lead to perfect learning with infinite data, the same objective function can cause harmful results.

Negative side effects can also occur when it has been specified an objective function that focuses on accomplishing a specific task in the environment but does not focus on other aspects of the environment. This means that the agent expresses indifference over other potentially harmful events that might occur if the agent does changes to the environment[11]. An example explained by Amodei et. al (2016)[11], is how to ensure that a cleaning robot while pursuing its goals will not disturb the environment in negative ways, such as destroying objects that stand in its way because it can clean faster by doing so. Another problem is that many autonomous systems are trained in simulated and controlled environments which gives the challenge of how the system perceives the real-world environment. One must be assured that the autonomous system will behave the same way in the real world as in the tested environment.

(18)

1.2 Motiviation and research goal

The focus of this thesis is centred around the design of autonomous agents that minimize negative side effects and prevent agents from taking undesirable actions while learning or when deployed.

The overall aim of this thesis is to study, implement and evaluate solutions for the side effect problems. To address this issue, it is necessary to have a good understanding of the reinforcement learning paradigm and its limitations. It is also necessary to explore the current state of the art of this problem. The expected outcome of this project is firstly an increased understanding of the behaviour of a reinforcement learning agent, and secondly, a possible solution to how the problem of side effects can be solved.

In order to achieve these goals, the study will compare two environments.

The aim of the comparison is to see how an agent in a safety setting where it is not directly aware of the side effects will behave, compared to an agent in a standard setting where it is directly aware of the consequences in doing dangerous actions. This means that the agent in the safety setting is not affected in the same way as the agent in the standard setting. Contrary to the agent in the safety setting, the agent in the standard setting is aware of the penalization if it does dangerous actions, as it is a part of its objective function.

Based on these observations, it will be possible to gain more knowledge on why the agent chooses the action as it does and which factors can prevent negative side effects. The prediction for the simulations is that the agent in the safety setting will not as easily learn the optimal behaviour compared to the agent in standard setting. This expectation is based on that an agent is programmed to follow an objective function and learns by the feedback it receives from the environment. It will, therefore, be challenging to change the agent’s behaviour and train it to avoid side effects if the penalization for the side effect is not a part of the agent’s objective function and the agent is not directly affected by being penalized. Both of the agent in their environments have been trained with reinforcement learning methods. Solutions to the problem is implemented within the OpenAI gym and GridWorld framework

(19)

and compared with the results in the literature.

1.3 Structure

The structure of the chapters in this thesis is as follows:

Chapter 2 - Background will first go through the main elements of reinforcement learning, followed by an introduction to AI safety concepts and an overview of the related work.

Chapter 3 - Methodologywill introduce and describe the algorithms and the environments used in this work, as well as the main frameworks.

Chapter 4 - Implementation will go through the implementation of the environments, reinforcement learning methods and the frameworks that are necessary for the simulations are explained in this chapter.

Chapter 5 - Results presents the outcome from the experiments. The results from each training are followed up by an analysis of the results.

Chapter 6 - Discussion will discuss the results from the simulations and compare it to the literature and former work.

Chapter 7 - Conclusion sums up the thesis, presents the limitations of this work and suggests future work.

(20)
(21)

Chapter 2 Background

The following chapter introduces the fundamental basis which is required to understand the topic and further chapters of this thesis. Terms, definitions and concepts that are used later will be described in this chapter. First, the concept of reinforcement learning is explained, followed by an overview of different methods in reinforcement learning. Then, the importance of AI safety and avoiding negative side effects is explained. Finally, the related work of this project is presented.

2.1 Reinforcement learning

Reinforcement learning is the problem of training an agent which takes action to maximize cumulative rewards[1]. Reinforcement learning has three basic concepts: state, action and reward.

The agent is the learner and the decision-maker. The environment is everything outside the agent that the agent interacts with. The interaction between the agent and the environment can be discrete or continuous[1].

The environment is defined as discrete if it has a finite number of actions and states of the environment, otherwise the environment is defined as continuous. The agent selects actions and the environment responds to the actions and presents new situations to the agent. The current situation is described by the state and action is what the agent can do in each state[13]. Based on which action the agent does, it receives a reward from

(22)

the environment. The reward is a numeric signal from the environment that passes to the agent. The goal for the agent is not only to maximize the immediate reward but also the total amount of reward it receives from the environment[1]. An agent in an environment can be a robot doing tasks such as picking up cans from an office for recycling. An action in this situation which would lead to positive reward is for the robot to actively search for cans and collect it. The state in this situation is the position of the robot in the office.

Figure (2.1) An illustration on how a reinforcement learning agent interacts with the environment[1].

There is no supervisor in reinforcement learning that labels what the best action is for the agent to take[14]. In supervised learning, examples are provided by an external supervisor that is knowledgeable and they are used for learning by the agent. In reinforcement learning, there is only a reward signal where the agent must do the correct actions to obtain more rewards[1].

In this type of machine learning, the agent is not told which action it should take, but rather try the different actions and learn which actions yield most rewards. The distinguishing features of reinforcement learning are the two characteristics, trial and error, and also delayed reward[1]. By delayed reward, it means that actions do not only effect the immediate reward, but also the situation that comes afterwards and through that, all the following rewards. Reinforcement learning is dependent on a series of actions, meaning which action the agent takes in the specific order will decide the final reward outcome[14].

(23)

2.1.1 Sub-elements of Reinforcement Learning

There are four additional sub-elements to a reinforcement learning system apart from the agent and the environment: a reward function, a policy, a value function and a model of the environment. The last sub-element is optional.

The reward function

In a reinforcement learning problem, the reward function defines the goal[1].

The agent gets a reward, R, for every step. The reward is a scalar value and presumed to be a function of the state[15]. The agent’s only goal is to maximize the total reward it receives in the long run[1]. The reward function is fixed and defines what are good and bad events for the agent.

If the policy leads the agent to an action that results in a low reward, the policy can be changed so that the agent can select another action[1]. The agent must exploit the knowledge it already has to gain rewards, but also explore to make better selections of action in the future[1]. The agent needs to explore by doing untried actions or actions the agent finds uncertain. By doing this, it can gain information about the rewards and behaviour of the system[15]. If the agent is only exploiting its knowledge, the actions it takes are called greedy. If the agent selects one of the non-greedy actions, the agent is exploring because it enables to improve the estimate of the non- greedy action’s value[1]. In the long run, exploration will lead to a greater sum of reward while exploitation will maximize the expected reward in one play. The conflict between exploration and exploitation occurs as it is not possible to do both with any single choice of action. It is usually important to both explore and exploit for most reinforcement learning problems[16].

Policy

The learning agent’s way of behaving at a given time is defined by a policy.

One can define a policy as the agent’s behaviour function[16]. A function that generates actions, a, based on the state, s, is called a policy, π [15].

Finding a policy that optimizes the continuing sum of rewards R(s, a) is a reinforcement learning problem. The policy can either be deterministic or

(24)

probabilistic. If the policy is deterministic, the agent will do the same actions for a given state in the form of a = p(s). The policy is probabilistic if the agents encounters a state and draws a sample from a distribution over actions, a∼p(s, a) = P(a|s) [15]. To be able to discover the relation between states, actions and rewards, the agent needs to explore. Exploration can either be directly embedded into the policy or as a part of the learning process and performed separately[15].

The value function

The value function in reinforcement learning specifies what is beneficial in the long run, while the reward function indicates what is beneficial in an immediate sense[1]. The value of a state is the sum of rewards an agent can expect to gain over the future starting from that state. The only purpose of estimating values is to attain more rewards, as without rewards there could be no value. When making and evaluating decisions, values are the most concerning component. Actions that brings the highest amount of value and not rewards are actions that can obtain a greater amount of reward over the long run. A method for efficiently estimating values is, therefore, the most important component of almost all reinforcement learning algorithms[1].

Model of the environment

The last, but also optional, component in a reinforcement learning system is the model of the environment. Model-based reinforcement learning uses experience to create an internal model of the transitions and immediate outcomes in the environment[17]. The model mimics the behaviour of the environment and is used for planning. This means that any way of deciding on a course of actions is done by taking into account future situations before they are experienced[1]. Model-free approaches can achieve the same optimal behaviour but without estimation or use of a model. Model-free uses experience to learn directly one or both simpler quantities such as state- action values or policies[17].

(25)

As shown in figure 2.2, the model-based reinforcement learning method makes a model of the environment and uses it for planning, while the model- free method learns the state-action values or the policy directly by experience.

Figure (2.2) The figure shows the difference between model-based and model-free reinforcement learning [2]

(26)

2.1.2 Markov Decision Process

Markov Decision Process (MDP) describes an environment that is fully observable in reinforcement learning[1]. MDPs formalizes the problem of reinforcement learning and defines the interaction between agent and environment in terms of states, actions, and rewards[1]. An environment is modelled as a set of states and actions that can represent the state of the system. The goal is to control the system so that some performance criteria is maximized[18].

The state is the information that is available to the agent and the state signal should include immediate sensations but can also contain more[1]. A state signal that retains all relevant information has the Markov property.

The formal definition of Markov property is that a state St is Markov if and only if[19]:

P[St+1|St] =P[St+1|S1, ..., St]

Once the current state is known for the agent, the rest of the history can be discarded[19].

Markov Decision Process is a tuplehS,A,P,R, γi, whereS is a finite set of states and A is a finite set of actions. P is the state transition probability matrix defined as[19]:

Pass0 =P[St+1 =s0|St=s, At =a]

The state transition matrix P defines the probability of making a transition from state s to another state s0, given any state s and action a [1]. R is the reward function with the expected value of the next reward given any current state s and action a, with any next state s0 [1]:

Ras =E[Rt+1|St=s, At =a]

Both the state transition matrix and the reward function are quantities that specify the most important aspects of the dynamics in a finite MDP. A finite MDP is defined by the one-step dynamics of the environment and its state and action sets [1]. γ is a discount factor [19]:

γ ∈[0,1]

(27)

The discount factor is the present value of future rewards and is used for various reasons[19]. One reason is that discounted rewards are mathematically convenient. Another reason is that one avoids infinite returns in cyclic Markov processes, as well as that uncertainty about the future may not be completely represented without a discount factor[19].

Another important feature in MDP is the Bellman equations. The Bellman equations let us solve MDPs, and are necessary to understand how reinforcement learning algorithms work[20]. Finding the optimal policy and the optimal value function will lead to full optimization and allow to do control in a MDP, which is necessary to solve a MDP[21]. Solving a MDP is finding an optimal policy that gains maximal reward over the long run[1].

The Bellman equation for the state value in a Markov reward process is defined as[19]:

v(s) = E[Rt+1+γv(St+1)|St=s]

The Bellman equations let us express values of states as values of other states [20]. If we know the value of St+1, the value of St can easily be calculated.

This gives the opportunity for iterative approaches for calculating the value for each state and also find the value of the current state[20]. By havingv(s), one can easily determine an optimal policy[1] and thereby solve a MDP.

Figure (2.3) An illustration of a MDP. The agent takes an action at each step which results in changing its state in the environment and receiving a reward[2]

(28)

2.1.3 Reinforcement learning methods

This subsection presents the main reinforcement learning methods. Backup diagrams are used as a graphical representation of algorithms and show the difference between reinforcement learning methods. A backup process represents state, action, state transition and rewards[3]. Value function, state or state-action, is transferred back to a state or state-action from its replaced state or state-action.

Figure 2.4 shows the different symbols that are present in a backup diagram. The hollow circle represents state value. The state-action value or the action value are represented by a solid circle. Furthermore, action is represented by an arrow starting from a state and the reward is conventionally shown after the action value. The arc starting from a state represents the action that results in maximum action value.

Figure (2.4) The key elements in a backup diagram. Redrawn from Roy[3]

(29)

2.1.4 Dynamic Programming

Dynamic programming can be used to compute optimal policies given an ideal model as a Markov decision process[1]. It refers to a collection of algorithms[1] and is a method to solve complex problems[22]. The complex problems are broken down into subproblems, solved individually and the solutions for the subproblems are combined to solve the overall problem[22].

Dynamic programming is used for planning in a Markov decision process and assumes full knowledge of the MDP.

Figure (2.5) The backup diagram for Dynamic Programming[4]

Figure 2.5 shows the backup diagram for dynamic programming. The figure shows that in dynamic programming, all possible actions are evaluated before updating the value function for the state St.

(30)

2.1.5 Monte Carlo method

The Monte Carlo method does not assume full knowledge of the environment but requires only experience[1]. The interaction with an environment can be either actual or simulated, where there are sample sequences of states, actions and rewards. The reinforcement learning problem is solved by averaging sample returns. Monte Carlo method is defined only for episodic tasks and this is to ensure that well-defined returns are available[1]. Episodic tasks are series of separate episodes where each episode consists of a finite sequence of time steps[1]. All episodes will eventually terminate regardless of which actions that are chosen and it is assumed that experience is divided into episodes. It is only after the completion of an episode that the value estimates and policies are changed. This method is not incremental in a step-by-step sense but in an episode-by-episode sense[1].

Figure (2.6) The backup diagram for the Monte Carlo Method[4]

Figure 2.6 shows the backup diagram the Monte Carlo method. As seen in the figure, the Monte Carlo method completes an entire trajectory and reaches the terminal state before updating the value function for the state St.

(31)

2.1.6 Temporal Differences

Temporal difference (TD) learning is a method that combines ideas from Monte Carlo and dynamic programming[1]. In the same way as Monte Carlo methods, TD learning can learn directly from experience without a model of the environment’s dynamic. Like dynamic programming, TD methods also bootstrap, which means that the estimates are updated based partly on other learned estimates, without waiting for a final outcome[1].

Figure (2.7) The backup diagram for Temporal Difference learning[4]

The backup diagram for temporal difference learning 2.7 shows that unlike the Monte Carlo method, the value function for the state, St, is updated before reaching the terminal state T.

(32)

2.1.6.1 Sarsa

The Sarsa algorithm is an algorithm for TD-learning. Sarsa uses only one policy and new actions and rewards are selected using the same policy that determined the original action[23]. The name of the Sarsa algorithm has its origin from the fact that the updates are done using the quintuple Q(s, a, r, s, a0). s,a are the original state and action and the r is the reward observed in that state. s0,a0 are the new state-action pair[23].

Sarsa is an on-policy method, meaning that it learns the value of the policy that is used to make decisions. The results from executed actions that are determined by some policy will update the value functions. These policies are usually non-deterministic and soft, meaning that it ensures that there is an element of exploration to the policy[23]. On-policy methods are different from off-policy in the sense that they attempt to evaluate and improve the same policy that they use to make decisions[1].

Figure (2.8) The backup diagram for the Sarsa method[1]

Figure 2.8 illustrates the backup diagram for the Sarsa method. The top node, which is the root of the backup, represents the state-action pair. At state S, an action A is performed which yields a reward R and a transition to the state S0. After the transition, the action A0 is performed according to the policy and the state-action pair is updated.

(33)

2.1.6.2 Q-learning

Another learning method for agents is Q-learning. This is a form of model-free reinforcement learning[24]. In Q-learning the agent learns to act optimally in Markovian domains by experiencing the effect of actions, without the need of building models of the domains[24]. The agent learns by trying an action at a particular state, evaluating the outcome of the action with regard to the immediate reward or penalty it receives and its estimate of the value of the state to which the agent has taken[24]. The agent learns which actions are best overall by trying all actions in all states repeatedly, considering the long-term discounted reward.

Q-learning is an off-policy method, meaning that unlike on-policy methods which only have one policy, this method has two policies. The behaviour policy is used to generate behaviour. The other policy, the estimation policy, is evaluated and improved and may be unrelated to the behaviour policy[1]. In contrast to on-policy algorithms, off-policy algorithms can separate exploration from control. Therefore, an agent that is trained using an off-policy method may learn tactics that it did not necessarily exhibit during the learning phase.

Figure (2.9) The backup diagram for the Q-learning method[1]

Figure 2.9 shows the backup diagram for the Q-learning method. The backup diagram for the Q-learning has an identical start as the Sarsa method.

The top node is the state-action pair. At state S, an action A is performed which yields a reward R and a transition to the stateS0. The bottom nodes

(34)

represent all the possible action in the next state. The arc in the figure indicates that all the actions are maximized before updating the state-action pair[1].

2.2 AI Safety

The International Organization for Standardization (ISO) has defined safety as “freedom from risk which is not tolerable”[25]. The definition from ISO implies that scenarios with non-allowable consequences should have adequately low probability or frequency of occurring in safe systems [9]. AI and machine learning algorithms depend on relevant observations to predict the outcome of future scenarios accurately. At the same time, autonomous systems need to make safety-critical decisions in real-time. Designers of such systems are therefore responsible for designing artificially intelligent systems that are safe.

2.2.1 The EU Commission and AI Safety

In June 2018, the European Commission set up an independent expert group who prepared the document “The Ethics Guidelines for Trustworthy Artificial Intelligence (AI)”[26]. The Guidelines are based on fundamental rights and ethical principles and the document lists seven key requirements that AI systems should meet in order to be trustworthy. Figure 2.10 shows the interrelationship of the seven requirements for a trustworthy AI.

According to the expert group, all of the seven requirements are of equal importance, support each other and should be implemented and evaluated throughout the AI system’s lifecycle[26].

(35)

Figure (2.10) Requirements for trustworthy AI

The expert group claimed that technical robustness, which is closely linked to the principle of prevention of harm, is a crucial component of achieving trustworthy AI. For a system to be technically robust, the AI systems have to be developed with a preventive approach to risks. The systems should reliably behave as intended while minimizing unintentional and unexpected harm, and preventing unacceptable harm[26]. Technical robustness also applies to the potential changes in the AI system’s operating environment or the presence of other agents, encompassing both human and artificial obstacles, that may interact with the system in an adversarial manner.

The expert group has added accuracy as an important feature when developing robust AI systems. Accuracy refers to the system’s ability to make correct judgement such as classify information into proper categories or making correct predictions, recommendations, or decisions based on data

(36)

or models. In case of occasional inaccurate predictions that are unavoidable, the system should indicate how likely these errors are. In situations where the AI system directly affects human lives, a high level of accuracy is crucial[26].

The results of AI systems need to be both reliable and reproducible. By being reproducible the AI system presents the same behaviour when repeated under the same conditions, making it possible to accurately describe what the AI systems do. Having a reliable AI system means having a system that works accurately with a range of inputs and in a range of situations[26]. A reliable system is needed to prevent unintended harms.

Furthermore, the expert group argued that there should be established processes to clarify and assess potential risks associated with the use of AI systems across various application areas. The level of safety measures required depends on the magnitude of the risks posed by an AI system, if it is high risks, it would be crucial for safety measures to be developed and tested proactively[26].

(37)

2.2.2 Compromising AI Safety

There are multiple procedures that can risk the safety of an AI system and leave it vulnerable and compromised. The list below considers the safety problems that can occur when designing an AI system[27].

• Safe interruptibility: Being able to interrupt an agent and override its actions at any time. An agent should have a design so that it does not seek or avoid interruptions.

• Absent supervisor: Making sure that an agent does not behave differently depending on the presence or absence of a supervisor.

• Reward gaming: Building an agent that does not try to introduce or exploit errors in the reward function in order to get more reward.

• Self-modification: Designing an agent that behaves well in environments that allow self-modification.

• Distributional shift: Ensuring that an agent behaves robustly when its test environment differs from the training environment.

• Robustness to adversaries: Designing an agent that can detect and adapt to friendly and adversarial intentions present in the environment.

• Safe exploration: Building an agent that respects safety constraints during normal operation, as well as also during the initial learning period.

• Avoiding side effects: Get agents to minimize effects unrelated to their main objectives, especially those that are difficult to reverse or completely irreversible.

To narrow it down when it comes to AI Safety, this work will mainly focus on how to avoid side effects. This is done by simulations and using environments that consider safety hazards, rewards and penalties.

(38)

2.2.3 Negative side effects

An autonomous agent that is designed to fulfill tasks on behalf of humans will have an objective function that the agent wants to maximize. By maximizing the objective function in reinforcement learning, the effect will result in higher cumulative rewards. The main focus when designing an AI system is to have an agent that follows the objective function and learns what it is created for. If the objective function does not consider the different aspects of the environment, except fulfilling the task, the agent’s actions can lead to harmful events.

Amodei et.al[11] have identified three reasons why side effects may occur.

1. The designer of the system may have specified the wrong formal objective function. This means that maximizing the objective function causes harmful results, even with perfect learning and infinite data. The objective function might be only specified to accomplish some specific task in the environment but ignores other aspects of the environment. Ignoring other aspect causes indifference over environmental variables that might be harmful to change.

2. The designer may know the correct objective function or have a method of evaluating the objective function. However, it is too costly to do so regularly. This could lead to possible harmful behaviour caused by bad extrapolations from finite samples.

3. The correct formal objective function may have been specified by the designer, causing correct behaviour, but something bad happens due to making decisions from insufficient or poorly curated training data or an insufficiently expressive model.

Negative side effects may also occur when the designers of the system have not considered different harmful scenarios or possible risks associated with the system in different environments. The agent can also “overfit” to the simulated environment causing the agent to stop progressing[27]. This can cause the agent to behave perfectly in the simulated environment, but not be able to avoid actions that can cause harm in the real world. Negative

(39)

side effects may be a result of an incorrect or incomplete specification of the objective function[12]. An example of this may be a cleaning robot that breaks object that is blocking its way while cleaning. The robot has been designed to clean but may not consider all objects that are standing in its way to achieving its goal. Another example is if a self-driving car was to drive its owner from A to B, and not consider the other cars on the road or people crossing the streets. This can lead to fatal consequences if the system shows indifference to objects or humans, but only wants to achieve its goal and maximize its cumulative reward. Because of such consequences, AI systems that are developed to work in our everyday life must be designed to be safe and robust.

When designing the autonomous system, there are many different factors the designer must consider. One of the factors may be that the autonomous agent should contribute in a positive way and not break objects, and most importantly, cause harm to human beings. It is also important that the agent does minimal change in its surroundings and only does what it is designed for, leaving the environment unaffected by the agent. Designing a safe AI system is crucial to maintain a safe environment with no or minimal uncontrollable events occurring.

Negative side effects can be restricted with constraints. This can be done by specifying and integrating constraints in the objective function of the AI system. Examples of integrated constraint for the cleaning robot and the self- driving car can be to avoid destroying objects and not harming pedestrians.

Although these constraints will limit the number of side effects occurring, it will not eliminate the problem of side effects as it will be impossible to enumerate and foresee all the possibilities that should be constrained.

In a dynamic environment, there will be events happening that might not have been specified in the system’s objective function and which will cause negative side effects. Therefore, the goal would be to have an agent that can limit the number of side effects occurring without implicitly integrate the constraints in its objective function. This way, the agent will have a general knowledge of how to avoid causing side effects without it being a part of its main objectives.

(40)

2.3 Related work

With an increased number of autonomous systems using machine learning and AI, several papers have been published to enlighten the possible side effects of using these type of technologies. The presented papers discuss and focus on how to design autonomous agents that can act safely in a given environment.

Amodei et. al(2016)[11] presents different approaches to avoid side effects that may arise while training the agent. The authors believe the side effects occurs by having the wrong objective function, not being careful about the learning process or other implementation errors causing harmful and unintended behaviour from the system. One of the approaches discussed in the paper for minimizing side effect is to penalize the autonomous agent if it changes the environment. If the agent was to be penalized for each change to the environment, it will eventually neutralize the agent as every action would impact the environment. Therefore, the authors believe one needs to define an impact regularizer[28]. This will lead the agent to prefer ways to achieve its goals with minimal side effects or give it a finite “budget” of impact. The authors believe the challenge here will be to formalize what “change to the environment” is, as the agent will interact with the environment continuously and thereby impact the environment.

Another approach is to learn an impact regularizer through training over many tasks. The agent would recognize harmful side effects and avoid the actions that cause side effects[28]. The agent will be trained for both the original task that is specified by the objective function and another task that can recognize the side effects. Even though the two tasks have different objective functions or operate in different environments, they might have similar side effects[28]. This can lead to transfer learning when holding back on one task component but not the other[11]. Hence, if the agent learns to avoid side effects on one task, it can transfer this knowledge to another task.

The authors conclude that these approaches can limit side effects and prevent, or at least bound incidental harm. Although, it will not be a sufficient replacement for extensive testing[11] and critical evaluation before deployment in real life settings[28].

(41)

Leike et. al(2017)[27] presents a series of reinforcement learning environments showing different safety properties of intelligent agents. The aim of this paper is to lay a groundwork for a broad environment suite for AI safety problems and contribute to the concreteness of the discussion around technical problems in AI safety.

One of the environments illustrates the problem of avoiding side effects.

For this environment, the authors have chosen to use a reward function and a performance function. The performance function is used for safety and can be thought as a second reward function hidden from the agent. This function assesses the performance according to what the authors want the agent to do. The nominal reinforcement signal is termed as the reward function and is observed by the agent. This environment is a Markov decision process.

The author uses an irreversible side effects environment where the reward function encourages the agent to get to a goal, while the performance function measures how well the agent has performed in regards to avoiding side effects.

There is an importance of having an agent that fulfills their main objectives, yet avoids situations that can lead to potential harm without specifying all safety constraints, such as the authors emphasize. Their simulations conclude that the agent reaches the goal without considering actions that can prevent the side effects in the environment.

The paper written by Krakovna et. al (2019)[29] focuses on how to design safe reinforcement learning agents that avoid unnecessary interruption to the environment they are in. The paper also discusses how one can measure side effects in a general way which is not specified to a particular environment and task, as well as incentivize the agents to avoid them[12]. The authors believe penalizing the side effects can lead to bad incentives[29]. By bad incentives, they mean preventing irreversible changes in the environment, including the actions of other agents[29].

They suggest a combination of two approaches; preserving reversibility and penalizing impact. With preserving reversibility, the agent is encouraged to prevent irreversible events. These two approaches are compared to the default outcome using a reachability-based measure[12]. The authors have taken into consideration cases where the objective requires irreversible

(42)

actions, such as breaking an egg to make an omelette. Penalizing the agent, in this case, will only lead the agent not to avoid any further irreversible actions. To overcome this, the reachability of all states are considered instead of the default state. Since each irreversible action cuts off more of the state space (e.g breaking an object will make all the states where the object was intact unreachable), the penalty will increase accordingly[12]. This measure is called “relative reachability”.

As a proof of concept, experiments with a tabular Q-learning agent in the AI Safety GridWorlds framework were conducted. The relative reachability showed improvements of existing approaches but the authors emphasized that the relative reachability definition in its current form is not easy to achieve in realistic environments. The reasons for this are caused by having too many possible states to be considered when the training begins. The agent is not aware of all the states, and it can be difficult to define and simulate the default outcome[12].

(43)

Chapter 3

Methodology

To resolve the stated research goals of this project, the differences in the autonomous agents’s behaviour will be studied based on two different algorithms and two different environments. The design choices made for the environments and the procedures of the training methods have a great impact on the agent’s behaviour. This chapter discusses the decisions made in regards to the chosen algorithms, environments and supporting technologies. The chapter explains the chosen algorithms, Sarsa and the Q-learning method, in detail. The two environments used for simulations, the Cliff-walking problem and the side effects Sokoban, are also reviewed and explained. Lastly, the main frameworks that have been useful for implementing the environments and the training methods, are briefly introduced.

3.0.1 Algorithms

The algorithms used for the training of the agent in the Cliff-walking environment and in the side effects Sokoban environment are presented here.

These two algorithms have a lot of similarities, but the main difference here is the update of the Q-value.

The Q-value is updated using the following parameters[23]:

• α - The learning rate. A numerical value between 0 and 1. A value

(44)

of 0 signifies that the Q-values are never updated, meaning nothing is learned. A high value will indicate that learning may occur quickly, but will also lead to unstable evaluations

• γ - The discount factor. A numerical value between 0 and 1. As mentioned previously in chapter 2, the discount factor will weight future rewards less than immediate rewards.

• maxα - The maximum reward that is possible to obtain in the state after the current one, that is the reward for taking the optimal action from that state on. This parameter is only used by the Q-learning algorithm and not the Sarsa algorithm.

3.0.1.1 The Sarsa algorithm

Algorithm 1 Sarsa

1: Initialize Q(s,a) arbitrarily

2: repeat(for each episode):

3: Initialize s

4: Choose afrom susing a policy derived from Q (e.g −greedy)

5: repeat(for each step of episode):

6: Take action a, observe r, s’

7: Choose a’ from s’using a policy derived from Q (e.g −greedy)

8: Q(s, a)←Q(s, a) +α[r+γQ(s0, a0)−Q(s, a)]

9: s←s0; a←a0;

10: until s is terminal

Algorithm 1 shows the steps in the Sarsa method. The first step in the algorithm is to initialize the Q-values table, Q(s, a). By initializing the Q- value table, one will get, for instance, a table with state-action pairs as values.

The next step in the algorithm is a loop which is repeated for each episode.

The number of episodes can be determined by the user. In this loop, the first step is to observe the current state, s. After observing s, an action, a,

(45)

is chosen based on one of the action selection policies. The action selection policies can, for example, be the -greedy policy. Additionally, there is a loop inside the current loop, which repeats for each step of each episode.

This equals to the number of steps the agent must take to end an episode.

In the case of this project, it is either to reach the goal for the side effects Sokoban environment or reach the goal or fall off the cliff for the Cliff-walking task. The next step is to take a new actiona, and observe the new reward, r, and the new state,s0. After that, a new action,a0, is selected from the state, s0, using the same policy that determined the original action. The Q-values are updated by using the new action and also the new reward. Lastly, for determining the next state-action pair, the state s is set to s0 and the action a is set to a0. This is repeated until a terminal state is reached.

3.0.1.2 The Q-learning algorithm

Algorithm 2 Q-learning

1: Initialize Q(s,a) arbitrarily

2: repeat(for each episode):

3: Initialize s

4: repeat(for each step of episode):

5: Choose a froms using a policy derived from Q (e.g −greedy)

6: Take action a, observe r, s’

7: Q(s, a)←Q(s, a) +α[r+γmaxα0Q(s0, a0)−Q(s, a)]

8: s←s0;

9: until s is terminal

Algorithm 2 is similar to the Sarsa algorithm. The main difference between the algorithms is that the Q-values in the Q-learning method are updated by the maximum reward for the next state.

First, the Q-values table Q(s, a), is initialized. The next step is a loop which is repeated for each episode. Later, the current state, s, is initialized.

After the initialization of s, a new loop starts for each step of the episode.

The next step in the loop is to choose the action, a, from the observed state,

(46)

s, using a policy derived fromQ. Furthermore, a new action is taken, and the reward, r, and the new state, s0, is observed. In the next step, the Q-value for the state is updated using the observed reward and the maximum reward possible for the next state. Lastly the state sis set to the new state s0. This process is repeated until a terminal state is reached.

The Q-learning algorithm, algorithm 2, is different from the Sarsa algorithm as it chooses the action that provides maximum reward from the next state, s0, instead of following its policy. The fact that the Q-learning method can achieve an optimal policy without following a given policy, makes this method off-policy.

(47)

3.1 Reinforcement learning in a standard setting: The Cliff-walking environment

The Cliff-walking environment introduces a task where the agent must cross the cliff edge without falling out of the cliff. Figure 3.1 shows a GridWorld with matrices 12x4, where the ten blue squares on the bottom row represents the cliff. The green cell on the left side is the starting point and the red cell on the right side is the goal. The main aim for the agent is to get from the starting point to the goal without stepping into the region marked as the “The Cliff”. Although this environment may not have been specifically designed with AI safety as a concern, it does underline the importance of having an agent that behaves safely. The agent is trained both to go along the edge of the cliff and to take a longer route to reach the goal while avoiding falling off the cliff.

The Cliff-walking task is an episodic task, where each episode ends either by reaching the goal or falling off the cliff. The possible actions are left, right, up and down. All transitions give a reward of -1 except for the cliff region where the agent will receive a reward of -100 and sent back to the start. The maximum reward possible to obtain from the environment when training with Sarsa is set to -17. The maximum reward possible to obtain from training with Q-learning is -13. During training, the −greedy is set to 0.1. This means that the exploration rate is 10% which is where the agent takes random actions. The remaining 90% is the exploitation rate, which is where the agent follows the policy when it needs to take an action[1].

Figure (3.1) Figure of the Cliff-walking problem

(48)

Figure 3.2 shows the optimal path for the Q-learning method and the safe path for the Sarsa method. The figure also shows that falling off the cliff will send the agent back to the starting point. While the Q-learning method has an optimal path in the region right above the cliff, the Sarsa method takes a longer and a safer path to reach the goal. The agent trained with the Sarsa method must take 17 steps from where it starts to reach the goal while the agent trained with the Q-learning method only needs to take 13 steps to reach the goal.

The different paths the agent can take, which depends on the training method, highlight the difference between on-policy(Sarsa) and off-policy(Q- learning) methods[1]. As mentioned previously, the Sarsa method evaluates the state-action value together to determine the value of the state. The update process is consistent with the current policy[30]. The Q-learning method, on the other hand, only evaluates the state value to determine the next action. This difference between how these methods updates their Q- values affects the agent’s behaviour and which path the agent takes. The Q-learning method chooses the path closer to the cliff. The reason for this is that when deciding which action to take next, the Q-learning method is more optimistic in value estimation and always assume the best action to be taken in the process[30]. However, the Sarsa method is more conservative in estimating values, which causes the agent to take safer actions.

Figure (3.2) The paths for the Sarsa method and the Q-learning method

(49)

3.2 Reinforcement learning in a safety setting: side effects Sokoban

3.2.1 The side effects Sokoban

The side effects Sokoban is an environment developed by Leike et.al[27], who are a part of the Deepmind AI Safety group. The environment is inspired by the Sokoban game. The goal in the Sokoban game is to push boxes around in a warehouse and try to get them to storage locations. The side effects Sokoban environment has only one box that the agent needs to push to get to the goal.

3.2.2 The environment

The environment of the Sokoban side effects has an additional function, called the performance function R∗. Classical reinforcement learning frameworks have an objective function that focuses on maximizing the cumulative (visible) reward signal given by R. According to Leike et.al [27], it is an important feature but does not capture everything that is relevant. Instead of only using the reward function, the performance function R∗ is also used.

This function is not observed by the agent and might or might not be identical to R.

For the Sokoban side effects, the performance function is designed to capture both the agent’s objective and the safety of its behaviour. Hence, an agent achieving the objective safely would score higher on the performance function than an agent that achieves the objective unsafely.

The grid is a 10x10 matrix which consists of an agent, a box, the goal cell and walls as well as empty cells where the agent can move around. The agent can move right, left, up and down, depending on where the agent is located.

In Figure 3.3 the agent is marked as A, the pushable box is marked as X, and the goal is marked as G. This is an irreversible side effects environment[27] where the agent is rewarded to get to the goal, G. It is irreversible in the sense that if the agent moves the box to a reversible position

(50)

and then back to its original position, the agent itself can not go back to its original position. The main goal here is to make the agent choose the longer path, such as moving the box to the right, rather than down in a corner.

This action will preserve the option to move back the box in its original position. The agent gets a reward and a penalty depending on where the box is placed and according to the performance function. Meaning, if the box is placed next to a contiguous wall, the agent is given a penalty of -5.

If the box is placed in a corner, the agent receives a penalty of -10. Granted that the agent chooses the short path, meaning placing the box in the corner, it would need to take 5 steps from start to reach the goal. However, if the agent places the box next to the contiguous wall, it would need to take 7 steps before reaching the goal.

The agent interacts with the environment in an episodic setting which means that at the start of each episode the environment is reset to its starting configuration until the episode ends. If the agent enters the goal cell, the episode ends and the agent will receive a reward of +50. There is also a default reward of -1 for each time step to encourage finishing the episode sooner rather than later and uses no discounting in the environment. The maximum reward possible for the agent to obtain from the environment in each episode is set to 45 if its trained without the wall penalty. If the agent is trained with wall penalty, the maximum amount of reward possible for it to obtain is set to 35.

Figure (3.3) The game interface for the side effects Sokoban environment

(51)

3.3 Frameworks

One of the technologies that are relevant for this master’s thesis is the OpenAI Gym which is a toolkit that is used for research on reinforcement learning problems[31]. The toolkit is often used by researchers for standardization and benchmarking results[32]. OpenAI Gym is a python library that offers users a large number of test environments to work on reinforcement learning algorithms with shared interfaces for writing general algorithm and testing the algorithms[32].

Another relevant technology is GridWorld. GridWorld is a framework providing grid-based environments where one can apply reinforcement learning algorithms. This makes it possible to find optimal paths and policies for the agents on the grid to get to their desired goal grid cells in a minimum number of moves[33]. GridWorld is a 2D grid where the agents start off at one grid cell and try to move to another grid cell located in another area[33].

These frameworks show how the agent acts and learns in given environments. By using these technologies, one can change the algorithms and control the agent to get the desired outcome. These frameworks simulate simpler environments than the real world. OpenAI Gym and GridWorld can still be a good indication of the agent’s behaviour. By using these frameworks, it is also possible to compare the outcome with former works while observing possible improvements.

(52)
(53)

Chapter 4

Implementation

This chapter explains how the main experiments and algorithms were implemented. The main functions and classes that are required for conducting the experiments are presented. First, the implementation of the OpenAI Gym framework and how it interacts with the environment is explained. Later, the implementation of the two environments described in chapter 3 is explained. Finally, the implementation of the two training methods, Sarsa and Q-learning, is described.

4.1 OpenAI Gym

As mentioned earlier, the OpenAI Gym toolkit provides environments with a shared interface which can be used while working with reinforcement learning problems[34]. The environments consist of the methods reset() and step(). The agent chooses an action at each time-step, and the environment returns an observation and a reward. The learning process starts by calling reset(), which returns an initial observation[34]. The step() method returns four values: observation(object), reward(float), done(boolean) and info(dict).

The observation value returns the observation of the environment. The reward value returns the amount of reward accomplished by the previous action. The done value returns True or False, with True indicating that the episode has been terminated. The info value returns a dictionary with diagnostic information that can be useful for debugging.

(54)

The OpenAI Gym environments also have the two attributes, action space and observation space, which are attributes of type Space. These attributes describe the format of valid actions and observations.

4.2 The Cliff-walking environment

The Cliff-walking environment is implemented by using the open-source implementation made available by Denny Britz[35]. The Cliff-walking environment is developed in the OpenAI Gym and uses the different methods specified by the gym environment.

One of the functions in the Cliff-walking environment specifies the coordinates of the task. This specification is later used by another function that calculates the transition probability for the next step based on the agent’s current position. The calculation of the transition probabilities depends on the action the agent takes, which can be up, down, right or left. This is the main function of the Cliff-walking environment that the training methods depend on to updating the Q-values.

4.3 The side effects Sokoban environment

The side effects Sokoban environment is implemented in the GridWorld framework and uses the pycolab library for visualization. Pycolab is a game engine for small reinforcement learning agents. The code is also open source-code published on Github by DeepMind[36]. The side effects Sokoban environment is more complex than the Cliff-walking environment and consists of several functions that are necessary to run the game properly.

The main class that is building the safety environment for the game is the SafetyEnvironment class. This class consist of functions that calculate the performance of the agent, the observation specification and the action specification. The SafetyEnvironment class is later used in another class, SideEffectsSokobanEnvironment, which builds a python environment and returns a base python interface for the side effects Sokoban game.

TheSideEffectsSokobanEnvironmentclass is used as an object in the

(55)

training methods, where the agent in the environment is trained to behave in the desired way. The AgentSprite class is also an important feature for running the environment. This class consists of a function that updates the received reward from each action taken and checks if the agent has reached the final goal.

4.3.1 The OpenAI Gym wrapper

The side effects Sokoban environment is developed using the GridWorld framework. This framework is incompatible with the Q-learning and Sarsa methods which have been implemented with respect to the OpenAI Gym framework. To overcome this problem, an OpenAI Gym wrapper has been used. The wrapper is developed by David Lindner and retrieved from Github[37]. This wrapper has been developed specifically for the environments described in the AI Safety GridWorlds. The main function of the wrapper has been to replace the observable specification and the action specification in the side effects Sokoban environment, which have been developed with respect to the GridWorld framework, to observation space and action space as it have been defined in the OpenAI Gym framework.

4.4 The Q-learning and the Sarsa function

The Q-learning and Sarsa algorithm are provided as open-source code on Github[35]. The code takes in an environment class, env, such as the side effects Sokoban environment or the Cliff-walking environment as an object.

The environment object is then used as a parameter in either the Q-learning function or the Sarsa function. This way the training function can use the different environments and run the environment in as many episodes as the user wants. The Q-learning and the Sarsa function takes the environment, a number of episodes, the discount factor, the learning rate and −greedy as parameters and returns Q. Q is the optimal action-value function and is a nested dictionary that maps states to action values.

(56)

4.4.1 Modification of the Sarsa function

The Sarsa function was modified to add the wall penalty to the episodic rewards and to update the learning policy. The episodic rewards is the total reward the agent has obtained after each episode and is given by the reward signal that is observed by the agent. The modification of the Sarsa function was accomplished by tracking the box in the grid. If the agent places the box in a corner, the wall penalty will be -10. If the agent places the box next to a contiguous wall, the wall penalty is set to -5. The wall penalty is hidden from the agent and does not affect the reward the agent receives from the environment. This is the same as the performance function, R*, which is not observed by the agent. The purpose of this modification is train the agent to avoid side effects, which in this case would be to place the box in a reversible position and preserve the option to place the box in its original place. The total return in episodic rewards will show if the agent has learned the desired behaviour. This function is, therefore, defined as a performance metric where the function measures how well the agent is to avoid side effects.

Referanser

RELATERTE DOKUMENTER

Based on the work described above, the preliminary empirical model was improved by adding both the receiver height and weather parameters to the explanatory variables and considering

This research has the following view on the three programmes: Libya had a clandestine nuclear weapons programme, without any ambitions for nuclear power; North Korea focused mainly on

The system can be implemented as follows: A web-service client runs on the user device, collecting sensor data from the device and input data from the user. The client compiles

As part of enhancing the EU’s role in both civilian and military crisis management operations, the EU therefore elaborated on the CMCO concept as an internal measure for

The dense gas atmospheric dispersion model SLAB predicts a higher initial chlorine concentration using the instantaneous or short duration pool option, compared to evaporation from

In April 2016, Ukraine’s President Petro Poroshenko, summing up the war experience thus far, said that the volunteer battalions had taken part in approximately 600 military

This report documents the experiences and lessons from the deployment of operational analysts to Afghanistan with the Norwegian Armed Forces, with regard to the concept, the main

Based on the above-mentioned tensions, a recommendation for further research is to examine whether young people who have participated in the TP influence their parents and peers in