Why Imitate, and If So, How? Karl H. Schlag

(1)

Why Imitate, and If So, How?

Karl H. Schlag

Manuel Milling, Severin Wünsch

Institut für Physik der Universität Augsburg

(2)

Motivation The Model

Imitation Behavioral Rules

Improving Rules

Drawback of Switch If Better Proportional Imitation Rule Selecting among Improving Rules Population Dynamics

Non-Stationary Multi-Armed Bandit Conclusion

(3)

Motivation

I Multi-armed bandit as a common problem in game theory

I Individuals in a population choose different actions

I Different actions lead to different (randomly) distributed payoffs

I Goal: Find strategy to maximize payoff without global knowledge

(4)

The Model

I Multi-armed bandit:

I Range of payoffs [α;ω]

I ≥2 arms

I stationary probability distributions of payoffs

I Population:

I ≥2 individuals

I Each individual chooses an action (arm) each round depending on its behavior rules

I New individuals can replace old ones (with the same knowledge)

Figure: Slot machines in Las Vegas

(5)

The Model

Imititation

I What means imitation?

I An individual memorizes only its last action and payoff

I An individual also samples the last action and payoff of another randomly chosen individual

I Animitating behaviormeans to either switch to the action of the other individual or to stay with the own action (no random action possible)

I This choice is based on the own and the sampled payoff

(6)

Formalization of Behavioral Rules (1)

I A Multi-armed bandit consists of a set of armsA, each arm has a payoff distribution bounded by [α;ω]

I An individual knows its last actioni with payoffx and the last action j with payoffy of the sampled individual

I The individual chooses actionk for the next step with a certain Probability:

F(i,x,j,y)k →∆(A)

I According to imitation k ∈ {i,j}

⇒F(i,x,j,y)_i+F(i,x,j,y)_j = 1

(7)

Behavioral Rules

Examples for Behavior Rules

I Never Switch

F(i,x,j,y)_j = 0

I Always Switch

F(i,x,j,y)j = 1

I Switch if Better

F(i,x,j,y)_j =

( 1, x <y 0, x ≥y

I Proportional Imitation Rule F(i,x,j,y)j =

( σ(y−x), x<y

0, x≥y

σ: constant

(8)

Formalization of Behavioral Rules (2)

I Probability that individuali (sampling j) chooses actionk without knowing the payoffs:

F_ij^k =^X

x,y

F(i,x,j,y)kPi(x)Pj(y)

I Pi(x): Probability of retrieving payoffx with action i

I P_j(y): Analogue

(9)

Behavioral Rules

Objective for behavioral rule

I Goal: Achieve best average payoff

I Performance of behavioral rule depends on multi-armed bandit

I Assuming a population W of N individuals in a state s ∈A^N at a time t, the expected payoff of an individual with behavioral ruleF at time t+ 1 equals to

EF,s^tπ⁰ = 1 N

X

c∈W

X

d∈W\{c}

Pr(c d)^X

r∈A

F_s^rt(c)s^t(d)πr

Pr(c d): Probability that individualc samples individual d πr: Expected payoff of actionr

(10)

Improving Rules

I A well performing behavioral rule should perform well on every multi-armed bandit in every state

I Reference rule: Never Switch with average payoff ¯π(s^t)

I Definition: A behavioral rule is improvingif the expected improvement

EIP_F(s^t) :=E_F_,s^tπ⁰−π(s¯ ^t)

is greater or equal to 0 for any state s and multi-armed bandit.

⇒ An improving rule works at least as good as ’Never Switch’ in any scenario

Degenerated Improving Rules

I E.g. Never Switch and Always Switch are improving

I Both will not increase expected payoff

→ Degenerated improving rules

(11)

Behavioral Rules

Lemma about improving rules

I Lemma: Let F be a behavioral rule. Then F is improving if and only if F is imitating and for any i,j ∈A,i 6=j

(F_ij^j −F_jiⁱ)(πj −πi)≥0

I Proof:

I Imitation is necessary due to existing information. Proof by contradiction with appropriate multi-armed bandit.

I if- and only-if-Statement EIP_F(s) = 1

N X

c∈W

X

d∈W\{c}

Pr(c d)F_s(c)s(d)^s(d) [π_s(d)−π_s(c)]

with a symmetric sampling probability

EIPF(s) = 1 N

X

i<j





 X

c:s(c)=i d:s(d)=j

Pr(c d)







(F_ij^j −F_jiⁱ)(πj−πi)

(12)

Drawback of Switch If Better

I Switch if better is not an improving rule for every multi-armed bandit

I It can not distinguish between lucky and certain payoffs

I Counterexample:

I x∈(α,(α+ω)/2), P1(x) = 1, P2(α) =λ, P2(ω) = 1−λ, 0< λ <1

I π2> π1if and only if λ < _ω−α^ω−x

I F₁₂² = 1−λandF₂₁¹ =λ

I F₂₁¹ >F₁₂² if and only if λ > ¹₂

I when ¹₂< λ <(ω−x)/(ω−α)→(F₁₂² −F₂₁¹)(π2−π1)<0

(13)

Behavioral Rules

Theorem 1

I Theorem: The behavioral rule is improving if and only if

I F is imitating and

I For all i,j∈A,i6=j, there existsσij =σji∈[0,1/(ω−α)] such that F(i,x,j,y)−F(j,y,i,x) =σij(y−x) for all x,y∈[α, ω]

I Time-intensive proof but,

I Fulfills first Lemma

F(i,x,j,y)_j−F(j,y,i,x)_i=σ_ij(y−x), σ_ij≥0 P

x,yPi(x)Pj(y)

=⇒ F_ij^j−F_jiⁱ =σij(πj−πi)≥0

I Improving rules like Never Switch, Always Switch fulfill this Lemma

I Non-improving rule Switch if better doesn’t fulfill this Lemma

(14)

Corollary about Degenerate improving Rules

I Corollary: An improving rule is degenerate if and only if σ_ij = 0 for all i,j ∈A,i 6=j

I Never Switch and always Switch are degenerated improving rules

I ⇒ non degenerated improving rules induce stochastic behaviour

I σ≤1/(ω−α) implies that multi-armed bandits with unbounded payoffs only have degenerated improving rules.

(15)

Behavioral Rules

Selecting among Improving Rules

I Adominant rule always generates the best expected improvement among all improving rules

I Dominant rules maximize the improvement of average expected payoff in a monomorphic population

EIP_F(s) =





 1 N

X

i<j

X

c:s(c)=i d:s(d)=j

Pr(c d)







σ_ij(π_j −π_i)²

I Proposition: A behavioral rule is a dominant rule if and only if it is improving and for any i 6=j,σij = 1/(ω−α)

(16)

Theorem 2

F^p is the unique dominant rule that satisfies any one of the following properties:

I (i) It never prescribes to imitate an action that achieved a lower payoff

I (ii) In each multi-armed bandit and state it minimizes the probability of switching among dominant rules

I (iii) For each multi-armed bandit and current state it minimizes the variance of the average payoff in a monomorphic population in the next round among dominant rules

Proof can be derived from theorem 1

⇒ F^p maximizes the probability that average payoffs realized in monomorphic population increase over time

(17)

Population Dynamics

Theorem 3

I Large population ofN 1 individuals following rule F (monomorphic).

I Random and independent sampling

I Theorem 3:

For eachδ >0, >0and T ∈Nthere exists N0 ∈N such that for any population size N>N0 and any initial statep˜∈∆^N(A), the event {kp^N(T)−p(T)k> δ} occurs with a probability less than where p^N(1) =p(1) = ˜p and (p(t))t∈N satisfies

p_i(t+ 1) =^X

j,r

p_j(t)p_r(t)F_jrⁱ t ∈N,

wherepi(t) denotes the ratio of individuals performing action i in round t to the total number of individualsN.

(18)

Theorem 3 (2)

I Proof follows induction overT

I Theorem 3 holds for short-run dynamics (smallT) of a population with a fix numberN of individuals.

I Long run behavior can differ dramatically from Theorem 3

I Remark: In a finite population there is a positive probability that all individuals will be playing the worst action after a finite number of rounds.

(19)

Population Dynamics

Theorem 3 for infinite populations

I infinite populations will behave deterministic:

pi(t+ 1) =pi(t) +pi(t)^X

j∈A

σijpj(t)(πi −πj)

I In the long run, a non degenerated rule (σ_ij >0) will lead each individual of an infinite population to choose the action with the highest expected payoff.

I Using the optimal rule leads to

p_i(t+ 1) =p_i(t) +p_i(t) 1

ω−α(π_i −π(p(t)))¯ with ¯π(p(t)) =^P_iπ_ip_i

I It can be shown that for a sufficient large population using the optimal rule, the probability is extremely high that all individuals will choose actions with a maximal payoff in the long run

(20)

Non-Stationary Multi-Armed Bandit

I Changes in payoff distributions

I Example:

I Individuals from two different groups are repeatedly randomly matched to play a one shot game

I Changes in payoff distribution result from different opponents

I Sampling occurs within each population

I Switching now depends on the population state

I Payoffs are limited for each groupi∈1,2 by [αi, ωi] Optimal Behavior

I An individual acts as they were facing a stationary bandit with optimal switching rate 1/(ω_i−α_i)

(21)

Conclusion

I Non degenerated improving rules increase the expected payoff for an individual for every restricted multi-armed bandit and population state

I The unique dominant rule performing better than every other improving rule is the Proportional Improving Rule

F(i,x,j,y)−F(j,y,i,x) =σij(y−x) withσij = 1/(ω−α)

I If two individuals are sampled, dominant rules do not exist

I Using the unique dominant rule most certainly leads all individuals of a large population to choose an action with the highest expected payoff in the long run

I For unlimited knowledge Imitate if Better is the unique dominant rule