• No results found

Reinforcement Learning

Figure 3: Reinforcement learning diagram.

Reinforcement learning allows the improvement of machine performance over time.

It is an approach wherein intelligent programs, called agents, perform actions in a known or unknown environment [7].

The agent and the environment are the two major components of reinforcement learning. The agent interacts with the environment through performing actions and gathering observations. It’s aim is to solve a problem and perform the solution in the most efficient way. On the other hand, the environment is everything outside the agent. It provides observations and feedback to the agent. It can be 2­dimensional or 3­dimensional worlds, or game­based scenarios [8].

The environment provides either positive or negative feedback to the agent depend­

ing on the action it performed. Through continuous interaction with the environment, the agent adapts and learns based on the feedback it has received [7].

Other components of reinforcement learning are actions, rewards and observations.

These are illustrated in Figure 3. Actions, which can either be discrete or continuous,

are the things that can be executed in the environment by the agent. While rewards and observations are communication channels between the agent and the environment.

Rewards is a way for the environment to provide feedback to the agent about the success of its latest activity. It is a scalar value that can be negative or positive. The agent obtains rewards whenever it interacts with the environment, or they can be given by the environment during specific timestamps. The goal of the agent is to accumulate the largest total reward through the series of actions it executes. This is what motivates the learning process of the agent. While the second communication channel for the agent is observations, through which the environment provides the agent information about what is happening around it [8].

The following sections discuss concepts that form the theoretical foundation of re­

inforcement learning. It starts with the simplest one, Markov Process, which, when expanded to include rewards, turns into a Markov reward process. Another layer of complexity is added by including actions in the Markov reward process, transforming it to a Markov Decision Process.

2.1.1 Markov Process

TheMarkov Process (MP), also known as the Markov chain, is a system that conforms to the Markov property. Any observations made of the system is referred to asstates.

Whilestate space is the set of all the possible states for the system. In MP, the state space needs to be finite. Over time, a sequence of observations forms a chain of states which is referred to ashistory.

The Markov property states that from any observable state, the future dynamics of the system is dependent only on the state itself. As such, the property requires unique and distinguishable states. By fulfilling the Markov property, the future dynamics of the system can be modelled with just one state, and not requiring the whole history.

Atransition matrixis used to summarize the transition probabilities between states into a square matrix. The size of the matrix is N xN, where N denotes the number of states. In row i and column j of the matrix, each cell holds the probability of the system moving from stateito statej [8].

2.1.2 Markov Reward Process

TheMPmodel is expanded by adding value to the transitions form one state to another.

In this way, rewards are introduced, and the simple MP becomes a Markov Reward Process (MRP).

In particular, there are two components added to the model, namely ­reward and discount factor. Reward is just a number which can be large or small, positive or nega­

tive. It can take on different types of representation. However, the most common way is to present it as a matrix like the transition matrix. Rowiand columnj contains the rewards for changing over from stateito state j [8].

On the other hand, the discount factor γ(gamma) is a single number that depicts the agent’s foresightedness. Its value can range from0to1. To understand its purpose, a return value at timetfor every episode is to be examined.Returnis calculated using the formula [8]:

Gt=Rt+1+γRt+2+...=

k=0

γkRt+k+1. (2.1)

Equation 2.1 computes the return for every time step as a sum of succeeding or future rewards. However, rewards that arektime steps away from the starting pointt are multiplied by the discount factorγraised tok. Inspecting the function ofγ, it can be seen that ifγ= 1, then the return valueGt would just be equal to the sum of all future rewards. This means that the agent can perfectly see any future rewards. Conversely, if γ = 0, then the return value Gt would just be the immediate reward without any consideration for any future rewards. The agent has complete short­sightedness. In functional applications, the values for the discount factor is commonly set in between0 and1, like0.9or0.99. The discount factor can be thought of as a measure of how much the agent looks into the future when estimating the future return. Asγ gets closer to 1, more of the future steps are taken into consideration [8].

Thereturnvalue is not very practical because it is defined for every chain observed from theMRP. As such, it can extensively diverge even for the same state. A much more practical quantity is thevalue of state. It is defined as the mathematical expectation of return for any state [8], where:

V(s) =E[G|St =s]. (2.2)

Equation 2.2shows that the value of stateV(s)for every statesis the expected or average return acquired by going through theMRP.

2.1.3 Markov Decision Process

In order to transform MRP into a Markov Decision Process (MDP), actions are added into the model. The first consideration is to have a finite set of actions, also referred to as the agent’saction space.

An extra dimension is required for the transition matrix in order to include action.

The agent is no longer an uninvolved observer of the state transitions, but now has the power to choose with action to take at every time step [8].

Figure 4: Illustration of MDP transition matrix from Lapan,Deep Reinforcement Learn­

ing Hands­On.

To better visualize the addition of the action dimension,Figure 4shows a 3­dimensional transition matrix. The depth dimension encompasses the possible actions (k) the agent can choose to take. The height dimension is the source state (i), and the width is the target state (j). When the agent chooses an action, the probabilities of the target states can be altered. By having a 3D transition matrix, theMDPcan cover all the intricacies of the environment and its range of possible feedback to the agent’s actions.

Furthermore, to completely turn MRP into a MDP, the reward matrix is updated as well with the addition of actions, like that of the transition matrix. As such, the attainable reward is dependent on the agent’s state and the action it has chosen to end up in this state [8].

Another main concept forMDPand reinforcement learning is thepolicy. It is defined as the set of rules that determine how the agent acts in the environment. It determines the amount of return obtained by the agent. So, it is vital that a good policy is found since it ensures that the agent’s goal of accumulating the largest return is achieved.

The formal definition of policy is as follows:

π(a|s) =P[At =a|St =s], (2.3) wherein it is the probability distribution over actions given every possible state [8].

If the policy is constant, theMDP reduces toMRP. The transition and reward matrices will not have the action dimension.