Why Imitate, and If So, How?
Karl H. Schlag
Manuel Milling, Severin Wünsch
Institut für Physik der Universität Augsburg
Motivation The Model
Imitation Behavioral Rules
Improving Rules
Drawback of Switch If Better Proportional Imitation Rule Selecting among Improving Rules Population Dynamics
Non-Stationary Multi-Armed Bandit Conclusion
Motivation
Motivation
I Multi-armed bandit as a common problem in game theory
I Individuals in a population choose different actions
I Different actions lead to different (randomly) distributed payoffs
I Goal: Find strategy to maximize payoff without global knowledge
The Model
I Multi-armed bandit:
I Range of payoffs [α;ω]
I ≥2 arms
I stationary probability distributions of payoffs
I Population:
I ≥2 individuals
I Each individual chooses an action (arm) each round depending on its behavior rules
I New individuals can replace old ones (with the same knowledge)
Figure: Slot machines in Las Vegas
The Model
Imititation
I What means imitation?
I An individual memorizes only its last action and payoff
I An individual also samples the last action and payoff of another randomly chosen individual
I Animitating behaviormeans to either switch to the action of the other individual or to stay with the own action (no random action possible)
I This choice is based on the own and the sampled payoff
Formalization of Behavioral Rules (1)
I A Multi-armed bandit consists of a set of armsA, each arm has a payoff distribution bounded by [α;ω]
I An individual knows its last actioni with payoffx and the last action j with payoffy of the sampled individual
I The individual chooses actionk for the next step with a certain Probability:
F(i,x,j,y)k →∆(A)
I According to imitation k ∈ {i,j}
⇒F(i,x,j,y)i+F(i,x,j,y)j = 1
Behavioral Rules
Examples for Behavior Rules
I Never Switch
F(i,x,j,y)j = 0
I Always Switch
F(i,x,j,y)j = 1
I Switch if Better
F(i,x,j,y)j =
( 1, x <y 0, x ≥y
I Proportional Imitation Rule F(i,x,j,y)j =
( σ(y−x), x<y
0, x≥y
σ: constant
Formalization of Behavioral Rules (2)
I Probability that individuali (sampling j) chooses actionk without knowing the payoffs:
Fijk =X
x,y
F(i,x,j,y)kPi(x)Pj(y)
I Pi(x): Probability of retrieving payoffx with action i
I Pj(y): Analogue
Behavioral Rules
Objective for behavioral rule
I Goal: Achieve best average payoff
I Performance of behavioral rule depends on multi-armed bandit
I Assuming a population W of N individuals in a state s ∈AN at a time t, the expected payoff of an individual with behavioral ruleF at time t+ 1 equals to
EF,stπ0 = 1 N
X
c∈W
X
d∈W\{c}
Pr(c d)X
r∈A
Fsrt(c)st(d)πr
Pr(c d): Probability that individualc samples individual d πr: Expected payoff of actionr
Improving Rules
I A well performing behavioral rule should perform well on every multi-armed bandit in every state
I Reference rule: Never Switch with average payoff ¯π(st)
I Definition: A behavioral rule is improvingif the expected improvement
EIPF(st) :=EF,stπ0−π(s¯ t)
is greater or equal to 0 for any state s and multi-armed bandit.
⇒ An improving rule works at least as good as ’Never Switch’ in any scenario
Degenerated Improving Rules
I E.g. Never Switch and Always Switch are improving
I Both will not increase expected payoff
→ Degenerated improving rules
Behavioral Rules
Lemma about improving rules
I Lemma: Let F be a behavioral rule. Then F is improving if and only if F is imitating and for any i,j ∈A,i 6=j
(Fijj −Fjii)(πj −πi)≥0
I Proof:
I Imitation is necessary due to existing information. Proof by contradiction with appropriate multi-armed bandit.
I if- and only-if-Statement EIPF(s) = 1
N X
c∈W
X
d∈W\{c}
Pr(c d)Fs(c)s(d)s(d) [πs(d)−πs(c)]
with a symmetric sampling probability
EIPF(s) = 1 N
X
i<j
X
c:s(c)=i d:s(d)=j
Pr(c d)
(Fijj −Fjii)(πj−πi)
Drawback of Switch If Better
I Switch if better is not an improving rule for every multi-armed bandit
I It can not distinguish between lucky and certain payoffs
I Counterexample:
I x∈(α,(α+ω)/2), P1(x) = 1, P2(α) =λ, P2(ω) = 1−λ, 0< λ <1
I π2> π1if and only if λ < ω−αω−x
I F122 = 1−λandF211 =λ
I F211 >F122 if and only if λ > 12
I when 12< λ <(ω−x)/(ω−α)→(F122 −F211)(π2−π1)<0
Behavioral Rules
Theorem 1
I Theorem: The behavioral rule is improving if and only if
I F is imitating and
I For all i,j∈A,i6=j, there existsσij =σji∈[0,1/(ω−α)] such that F(i,x,j,y)−F(j,y,i,x) =σij(y−x) for all x,y∈[α, ω]
I Time-intensive proof but,
I Fulfills first Lemma
F(i,x,j,y)j−F(j,y,i,x)i=σij(y−x), σij≥0 P
x,yPi(x)Pj(y)
=⇒ Fijj−Fjii =σij(πj−πi)≥0
I Improving rules like Never Switch, Always Switch fulfill this Lemma
I Non-improving rule Switch if better doesn’t fulfill this Lemma
Corollary about Degenerate improving Rules
I Corollary: An improving rule is degenerate if and only if σij = 0 for all i,j ∈A,i 6=j
I Never Switch and always Switch are degenerated improving rules
I ⇒ non degenerated improving rules induce stochastic behaviour
I σ≤1/(ω−α) implies that multi-armed bandits with unbounded payoffs only have degenerated improving rules.
Behavioral Rules
Selecting among Improving Rules
I Adominant rule always generates the best expected improvement among all improving rules
I Dominant rules maximize the improvement of average expected payoff in a monomorphic population
EIPF(s) =
1 N
X
i<j
X
c:s(c)=i d:s(d)=j
Pr(c d)
σij(πj −πi)2
I Proposition: A behavioral rule is a dominant rule if and only if it is improving and for any i 6=j,σij = 1/(ω−α)
Theorem 2
Fp is the unique dominant rule that satisfies any one of the following properties:
I (i) It never prescribes to imitate an action that achieved a lower payoff
I (ii) In each multi-armed bandit and state it minimizes the probability of switching among dominant rules
I (iii) For each multi-armed bandit and current state it minimizes the variance of the average payoff in a monomorphic population in the next round among dominant rules
Proof can be derived from theorem 1
⇒ Fp maximizes the probability that average payoffs realized in monomorphic population increase over time
Population Dynamics
Theorem 3
I Large population ofN 1 individuals following rule F (monomorphic).
I Random and independent sampling
I Theorem 3:
For eachδ >0, >0and T ∈Nthere exists N0 ∈N such that for any population size N>N0 and any initial statep˜∈∆N(A), the event {kpN(T)−p(T)k> δ} occurs with a probability less than where pN(1) =p(1) = ˜p and (p(t))t∈N satisfies
pi(t+ 1) =X
j,r
pj(t)pr(t)Fjri t ∈N,
wherepi(t) denotes the ratio of individuals performing action i in round t to the total number of individualsN.
Theorem 3 (2)
I Proof follows induction overT
I Theorem 3 holds for short-run dynamics (smallT) of a population with a fix numberN of individuals.
I Long run behavior can differ dramatically from Theorem 3
I Remark: In a finite population there is a positive probability that all individuals will be playing the worst action after a finite number of rounds.
Population Dynamics
Theorem 3 for infinite populations
I infinite populations will behave deterministic:
pi(t+ 1) =pi(t) +pi(t)X
j∈A
σijpj(t)(πi −πj)
I In the long run, a non degenerated rule (σij >0) will lead each individual of an infinite population to choose the action with the highest expected payoff.
I Using the optimal rule leads to
pi(t+ 1) =pi(t) +pi(t) 1
ω−α(πi −π(p(t)))¯ with ¯π(p(t)) =Piπipi
I It can be shown that for a sufficient large population using the optimal rule, the probability is extremely high that all individuals will choose actions with a maximal payoff in the long run
Non-Stationary Multi-Armed Bandit
I Changes in payoff distributions
I Example:
I Individuals from two different groups are repeatedly randomly matched to play a one shot game
I Changes in payoff distribution result from different opponents
I Sampling occurs within each population
I Switching now depends on the population state
I Payoffs are limited for each groupi∈1,2 by [αi, ωi] Optimal Behavior
I An individual acts as they were facing a stationary bandit with optimal switching rate 1/(ωi−αi)
Conclusion
Conclusion
I Non degenerated improving rules increase the expected payoff for an individual for every restricted multi-armed bandit and population state
I The unique dominant rule performing better than every other improving rule is the Proportional Improving Rule
F(i,x,j,y)−F(j,y,i,x) =σij(y−x) withσij = 1/(ω−α)
I If two individuals are sampled, dominant rules do not exist
I Using the unique dominant rule most certainly leads all individuals of a large population to choose an action with the highest expected payoff in the long run
I For unlimited knowledge Imitate if Better is the unique dominant rule