Information theoretic learning for pattern classification

(1)

Master's Thesis in Eletrial Engineering

Information Theoreti Learning for

Pattern Classiation

by

Ola Kvisle Storås

Deember, 2007

FACULTY OF SCIENCE

Department of Physis and Tehnology

University of Tromsø

(2)

(3)

Thisthesisisastudyofpatternlassiationbasedoninformationtheoretiriteria. Infor-

mationtheoreti riteriaareimportantmeasures basedonentropy anddivergene between

data distributions. First, the basi onepts of pattern lassiation with the well known

Bayes lassiation rule as a startingpointis disussed. We disuss how the Parzen win-

dow estimator may be used to nd good density estimates. The Parzen window density

estimator an be used to estimate ost funtions based on information theoreti riteria.

Furthermore,we explain amodelof aninformationtheoreti learningmahine. Withost

funtionsbased oninformationtheoreti riteria,weargue thatalearningmahinepoten-

tiallyapturesmuhmore informationaboutadata setthan thetraditionalmeansquared

error ost (MSE) funtion. We nd that there is a geometri link between information

theoreti ost funtions estimated using Parzen windowing, and mean vetors in a Mer-

er kernel feature spae. This link is used to propose and implement dierent lassiers

based on the integrated squared error (ISE) divergene measure, operating impliitly in

a Merer kernel feature spae. We also apply spetral methods to implement the same

ISE lassiers working inapproximations of Merer kernel feature spaes. We investigate

the performane of the lassiers when we weight eah data pointwith the the inverse of

the probability density funtion at that point. We nd that the ISE lassiers working

impliitly in the Merer kernel feature spae performs similar to a Parzen window based

Bayes lassier. Usinga weighted inner-produt denition gives slightly better results for

some data sets, while for other data sets the lassiation rates are slightly worse. When

omparingtheresults between theimpliitISElassierusing unweighted datapointsand

theParzenwindowBayeslassier,someoftheresultsindiatethattheISElassierfavor

the lasses with highest entropy.

(4)

(5)

I would like to thank my supervisor in this thesis, Robert Jenssen, for introduing and

explainingmanynewfasinatingideasinpatternanalysisduringtheworkwiththisthesis.

(6)

(7)

1 Introdution 1

1.0.1 Quik summaryof ontent inthis thesis. . . 3

1.1 Denitionsand notation used . . . 4

1.2 Struture and literature . . . 4

I Theory 7 2 Pattern lassiation basis 9 2.1 The Bayes lassier . . . 9

2.2 Minimum probabilityof lassiation error for the Bayes lassier . . . 10

3 Density estimation 13 3.1 Parametri methods . . . 13

3.2 Non-parametrimethods . . . 14

3.2.1 The histogramdensity estimator . . . 14

3.2.2 The naive density estimator . . . 14

3.2.3 The Parzen windowdensity estimator. . . 15

3.2.4 The multivariateParzen window . . . 16

4 Information theoreti learning priniples 19 4.1 Informationtheory . . . 19

4.2 Aninformation theoreti learningmahine . . . 20

4.3 Informationtheoreti quantities . . . 22

4.3.1 Entropy . . . 22

4.3.2 Divergene . . . 29

4.4 Estimationof informationtheoreti ost funtions . . . 32

4.4.1 Renyi quadratientropy estimate . . . 33

4.4.2 ISE divergene estimate . . . 34

4.4.3 Cauhy-Shwarz divergene estimate . . . 34

4.4.4 Usingnon-Gaussian kernels in estimation of ost funtions . . . 35

4.4.5 Shannon entropy estimate . . . 35

(8)

5.2 Informationmeasures in the Merer kernel spae . . . 38

5.2.1 ISE divergene . . . 39

5.2.2 Cauhy-Shwarz divergene . . . 40

5.3 The standard ISE lassier . . . 42

5.3.1 Connetionsto Parzen windowBayes lassier . . . 43

5.3.2 Multilassstandard ISE lassier . . . 44

5.3.3 UsingNon-Merer kernels . . . 45

5.3.4 Connetionto the graph ut . . . 45

6 A Laplaian ISE lassier 47 6.1 Modied ISE divergene . . . 48

6.2 Connetionto the Bayes probability of error . . . 48

6.3 Kernelspae and Laplaian matrix representation . . . 49

6.3.1 Illustrationof weights . . . 52

7 Spetral ISE lassiers 55 7.1 Mappingof data to aMerer based feature spae . . . 56

7.1.1 Illustrationof mappings to Merer spae . . . 57

7.2 Spetral versions of the ISE lassiers . . . 59

7.2.1 The spetral ISE lassier . . . 59

7.2.2 The spetral LaplaianISE lassier . . . 59

II Analysis and experiments 61 8 Kernel seletion 63 8.1 Eet of kernel bandwidth and kernel type . . . 63

9 Classiation experiments 69 9.1 Introdution . . . 69

9.2 Seletion of data sets and lassiation methods . . . 69

9.3 Standard ISE . . . 70

9.3.1 Confusionmatries Standard ISE and Bayes . . . 72

9.3.2 Summaryof lassiation results for the standard ISE lassier . . 76

9.4 LaplaianISE . . . 77

9.4.1 Summaryof lassiation results for the LaplaianISE lassier . . 84

9.5 Spetral ISE . . . 84

9.5.1 Summaryof lassiation results for the spetral ISE lassier . . . 89

9.6 Spetral LaplaianISE . . . 90

9.6.1 SummaryoflassiationresultsforthespetralLaplaianISElas-

(9)

10 Conlusion 97

10.1 Furtherwork . . . 97

Appendix 99

A Laplaian ISE lassier 99

A.1 Averageonfusion matriesLaplaian ISE lassier . . . 99

B Spetral ISE lassier 101

B.1 Averageonfusion matriesSpetral ISE . . . 102

C Spetral Laplaian ISE lassier 105

C.1 Averageonfusion matriesfor the spetralLaplaian ISElassier. . . 106

Bibliography 109

(10)

(11)

Introdution

Patterns are used to desribe any relations, regularities or struture inherent in a data

set generated by a soure. Similar patterns are often grouped into a lass. By deteting

signiant patterns in the available data, a learning mahine an make preditions about

new data oming fromthe same soure [26℄. In patternlassiation we use some labeled

trainingdataset, whereeahlabelrepresent alass,and usealearningmahinetopredit

the orret lass label for a new unlabeled sample. A learning mahine may be viewed

as a devie that adjust a set of parameters through a learning proess. For eah new set

of training data given to the mahine, the parameters are updated using a riteria that

aptures thewanted informationtodesribethedata inanewform. Patternlassiation

isone ofthe fundamentalproblemsinmahine learningand signalproessing[29℄, [26℄,[4℄.

Itisanimportantpartinomputervision,medialimaging,optialharater reognition,

geostatistis, handwriting reognition, biometriidentiation, natural language proess-

ing,doumentlassiation,emailspamdetetionandreditsoring,tolistafewexamples

[29℄, [4℄, [23℄,[1℄, [15℄, [8℄.

Information theoreti learning (ITL) methods emerged in [18℄ and [5℄. Information

theoreti learning here refers to the use of a general learning mahine as we desribe

in Setion 4.2, where a riteria related to Renyi's quadrati entropy is used to update

the learning proess. The riteria often use Renyi's quadrati entropy to nd divergene

measures between data in dierent distributions. A divergene measure an be thought

of as a generalization of algebrai distane measures (suh as the Eulidean norm) to

probability distributionspaes [5℄. In [18℄ two important informationtheoreti divergene

measures were presented, one is based on the Cauhy-Shwarz (CS) inequality and the

other onthe integrated squared error (ISE)between two probability distributions.

The Renyi's quadrati entropy is estimated using a Parzen window density estimation

method, and may be thought of as a generalization of variane to proesses with non-

Gaussiandistributions[5℄. In[18℄and [5℄informationtheoreti oneptsare explainedand

used in time series predition, independent omponent analysis (ICA), feature extration

and blind soure deonvolution.

Independent of ITL, a number of kernel methods have emerged in the reent years.

Kernel methods generally solve mahine learning problems in two parts: A module, also

(12)

known as a kernel funtion, performs a mapping of data to a new feature spae. In this

newfeaturespaealearningalgorithmisusedtodisoverlinearpatterns [26℄. Theuse ofa

kernelfuntionallows ustooperateimpliitlyinapossiblehighdimensionalspaethrough

evaluations of inner-produts. This is also known as the kernel trik. The advantage

of operating in high dimensional spaes is that the probability that the data is linearly

separable inreaseswiththe numberofdimensions weoperatein[2℄. Kernelmethodshave

been used suessfully in various elds of mahine learning, and inlude algorithms suh

assupportvetormahines(SVMs), kernelFisherdisriminantanalysis (KFD)and kernel

prinipalomponent analysis (KPCA) [17℄.

The anity matrixof adata set is alulatedwith the pairwiseinner-produts ofdata

samples with a kernel funtion, Element

(i, j)

^of ^this ^matrix ^ontains ^the inner-produt between data sample

i

^and

j

^, ^omputed^with ^a^kernel ^funtion. ^The inner-produtsof the anity matrix may be weighted suh that eah data point is multipliedwith the inverse

overallprobabilitydensityfuntionatthatpoint. Thisanitymatrixwithweightedinner-

produts is referred to as the Laplaian matrix. Spetral methods based on the spetral

properties, i.e. the eigenvetors and eigenvalues of the anity matrix or the Laplaian

matrix, havebeen popular inreent lustering appliations [5℄,[9℄.

It has been shown [13℄ and [9℄, that when using the non-parametri Parzen window

density estimatorwitha Mererkernel funtiontoestimatethe Renyi'squadrati entropy

for adata set, the result may be interpreted in termsof a mean vetorin a Merer kernel

feature spae. By measuring distanes between lass mean vetors in a Merer kernel

feature spae we an reate dierent information theoreti divergene measures between

lass distributions. TheCS divergene measure isshownin[13℄toberelatedtomeasuring

angles between lass mean vetors in a Merer kernel feature spae and have interesting

onnetionstograph theory. The ISEdivergene measure isina similarmannershown to

be relatedto the Eulidean distane between mean vetors ina Merer kernel spae [9℄.

The link between ITL and Merer kernel methods has been used to develop reent

lassier [12℄, [9℄, and lustering [9℄, [11℄, [10℄ algorithms based on the CS divergene

measure. The lassier proposed in [12℄ works impliitlyina Merer kernel feature spae,

whiletheCSbasedspetrallusteringalgorithmsworkinapproximateMererkernelspaes

spanned by the prinipal eigenvetors of the Laplaian matrix. These methods have all

usedweightedinner-produtkernelfuntionswhenalulatingtheCSdivergenemeasure.

This thesis is inspired by the use of the CS divergene measure in both lassiation

and lustering appliations. We provide neessary bakground on information theoreti

learning onepts to understand newly disovered, important relations between Merer

kernel theory, information theoreti measures and density estimation. In partiular we

fousonthepropertiesoftheISEinformationtheoretidivergenemeasureanduseMerer

kernel properties and geometri properties of this measure to argue that it may be used

as a ost funtion in an information theoreti lassier. Previous information theoreti

lassiers havetoour knowledgeonlybeenimplemented usingthe weighted CS divergene

measure. Thus we aim to use the ISE divergene measure in a similar manner. We

investigate performane and properties of lassiers based on the ISE divergene using

(13)

through evaluations with Merer kernels.

WealsoinvestigatespetralversionsoftheISElassiers,whereweeigendeomposethe

Laplaiandata matrix or the anity matrix. None of the ISE divergene based lassiers

presented inthis thesis havebeen presented before, sowe ouldnot know howthey would

perform. Wehoose toompare theresultsobtainedwithourimplementationoftheBayes

lassier, both beause it is a well known lassier and beause we nd out that the ISE

based lassierrule may beexpressed in avery similarway tothe Bayes lassierrule.

We present two dierent versions of the ISE lassier operating impliitly in dierent

Mererspaes. ThestandardISEandtheLaplaianISElassier,operatingonunweighted

and weighted trainingdata, respetively. We alsodevelop spetral versions of these las-

siers, the spetral ISE lassier and the spetral Laplaian ISE lassier, working in

the approximated Merer spaes spanned by the prinipal eigenvetors of the anity and

Laplaianmatries, respetively. The ISE based lassiers working impliitly ina Merer

kernelspaeingeneralseemstogivebestresults. ForsomedatasetstheLaplaianindued

weights improve the lassiation rates, but in others it redues or does not hange the

lassiation rates signiantly.

1.0.1 Quik summary of ontent in this thesis.

•

^W^e^provide^neessary^bakgroundinformationabouttherelativelynewoneptsused ininformationtheoreti learning. We alsoreview bakground informationneessary

tounderstand basi density estimation and patternlassiation.

•

^Wê întrodueând investigatenew lassiers based on the informationtheoreti ISE divergene measure and kernel methods, using both weighted and unweighted data.

•

^W^e investigate relations between an ISE divergene based lassier operating impliitly in a Merer kernel spae and the well known Parzen window based Bayes

lassier. Wendthat usingunweighted data theISE lassierisomparabletothe

Bayes lassierwith slightly dierent properties.

•

^Wê ûse ^the ^spetral ^properties ôf ^the ânity ând ^Laplaian^matries ôf ^the ^data, ^to

proposeandinvestigateISEbasedlassiersworkingdiretlyinapproximatedMerer

kernel spaes. We note that in most ases the spetral versions of the ISE lassier

perform slightly worse than the versions working impliitly inMerer spaes.

(14)

1.1 Denitions and notation used

x,y A training patternand lass label

i,N Counter and number of patterns

i.i.d Independent identiallydistributed

pdf Probability density funtion

ISE Integrated Squared error

MISE Mean Integrated Squared Error

AMISE Asymptoti Mean Integrated Squared Error

IP Information Potential

ITL Information Theoreti Learning

W _σ ² ( · , · )

^A ^Gaussian ^kernel, ^with ^bandwidth

σ ² K h ( · , · )

^A ^general ^Merer ^kernel, ^with ^bandwidth

h F

^A ^Merer ^kernel ^feature ^spae

d

Dimensionality of data

C

^Number^of ^lasses

Table 1.1: Denitions and abbreviations

Density funtions are usually referened with small letters e.g.

p(x)

^. Probabilities are referened with large letters e.g.

P (x)

^. Întegrals ^with ^no ^limits âre âssumed ^to ^be ^with

lowerlimit

−∞

^and^upper^limit

∞

^. ^The^sample

x

^may^take^any ^valueⁱⁿ^the

d

dimensional spae, unless otherwise speied. Expetations with regard to a variable or funtion

f

^is

denoted by

E f {·}

îf ît ^may ^beûnlear ^what ^we âlulate^the êxpeted ^value ^with ^respet

to.

1.2 Struture and literature

Struture of this thesis This thesisis divided inthree parts. In Part I wepresent the

theory neessary to understand and implement our lassiers. Part II ontains analysis

and experiments done to hek how the theory works onsome populardata sets. In Part

III we onlude the thesis and suggests further work that may be done.

•

^Chapter ²întrodue ^the^basi ônepts ôf^patternlassiation. TheBayeslassier is used as an example and shown to give a minimum probability of lassiation

error.

•

^Chapter ³^disusses ^various^methods^for ^density êstimation^with ânêmphasis ôn^the

Parzen windowdensity estimatorand itsproperties.

•

^Chapter ⁴ êxplains ^the ônept ôf ân information theoreti learning mahine. A detailed disussion of various information theoreti riteria is given. Some sample

(15)

•

^Chapter ⁵ êxplain ^the ^kernel ^trik ând ^how ît ân ^be ûsed ^to êxpress information theoreti measures inaMererkernel spaefromasimple geometriviewpoint. The

standard ISE lassieris developed using this geometri view of the ISE divergene

between two distributions. This lassier is shown in a speial ase to be equal to

the familiarBayeslassierusing density estimates. Somerelationsbetween theISE

divergene measure and graph theory are disussed.

•

^Chapter ⁶ ^presents ^the ^Laplaian ÎSE ^lassier, â ^modied ^version ôf ^the ^standard

ISE lassierusing weighted data samples.

•

^Chapter ⁷ ^presents ^spetral ^versions ôf ^the ^standard ÎSE ând ^Laplaian ^lassier,

working in approximated Merer kernel spaes.

•

^Chapter ⁸ ^provides â ^short ânalysis ôf ^the êet ôf ^dierent ^kernel ^sizes ând ^kernel

typesused in the previously presented lassiers.

•

^Chapter ⁹^presents ând ^disusses ^results^found ûsing ^the^dierent^lassiers ôn^some

popular datasets. Wetry toillustratethe eets of weighting the data samplesand

howdata distributes inthe approximate Merer kernel spaes.

•

^Chapter ¹⁰ ^onludes ^this ^thesis ^and ^suggests ^some ^work ^that ^may ^be ^done ⁱⁿ ^the

future.

•

^The âppendix ôntains ^some âdditional^lassiation^results ^not ^listedⁱⁿ ^Chapter ^9.

Literature Information theory in general is overed by [3℄ and the artile whih rst

used entropy as a measure in ommuniation, [25℄. The priniples behind information

theoretilearningwasmainlyintroduedin[18℄ withannieoverview in[5℄and[9℄. Good

introdutionbookstopatternlassiationmethodsare[29℄,[4℄and[8℄. TheCSdivergene

based lassier whih inspiredmuh of the work with our ISE divergene based lassiers

is presented in [12℄. Important relationsbetween Merer kernel funtions,Parzen window

density estimators, CS divergene and graphtheory isreviewed in[13℄. Asurvey of kernel

methods used in pattern analysis is given in [26℄ and [17℄. Spetral methodsused in this

thesis are inuened by materialin[26℄ [9℄, [24℄, [6℄and [31℄.

(16)

(17)

Theory

(18)

(19)

Pattern lassiation basis

This hapter gives a denition of pattern lassiation. It also reviews the well known

Bayes lassiation rule, whih is shown to give a minimum probability of lassiation

error. TheBayeslassiermaythus beused asabenhmark whentesting otherlassiers

withregardtoerrorrates. InPartII,Chapter 9,weuseanimplementationofthislassier

toompare withthe new lassiers whih willbepresented inlater haptersof thisthesis.

Denition The task of lassiation is to nd a rule, whih based on observations of

trainingpatterns assigns anunlassied pattern toone of several possible lasses. A las-

siationrule with two dierent lasses isto estimatea funtion

g : R ^d −→ {− 1, 1 } ,

usinginput-outputtrainingdatapairsgeneratedi.i.daordingtoanunknown probability

distribution

p(x, y ), (x 1 , y 1 ), . . . , (x N , y N ) ∈ R

^d

× Y, Y {− 1, 1 }

suhthat

g

^will^orretly^lassify^a ^new ^sample

x

^[17℄. ^A ^sample

x

^is^assigned^to^the ^lass

labeled

+1

^if

g(x) ≥ 0

^. ^The^sample

x

^is^assumed ^to^be^generated^from^the^same^pdf

p(x, y)

as the trainingdata.

2.1 The Bayes lassier

To explainmore indetail howwean dene alassierwe begin withthe Bayeslassier.

The reason for starting with this is that it an be shown to give an optimal result with

regard to minimum probability of lassiation error, under ertain onditions. This will

be proved in the next setion. It isalso well known and has some theoretial onnetions

tothe ISE lassier, whih willbeexplained inSetion 5.3.1.

Wewanttolassifyanunknownfeaturevetor

x

^to^one^of

C

^possible^lasses

ω 1 , . . . , ω C

in a way that assigns

x

^to ^the ^lass ^where ît's^most ^likely ^to ^belong. ^Wê ^dene ^what îs

(20)

most likely with the probabilities

P (ω i | x), i = 1, . . . , C

^, âlso ^known âs ^the â ^posteriori

probabilities. A possible lassiation rule isto assign

x

^to ^the ^lass

ω ^∗

^satisfying^[29℄

ω ^∗ = max

ω i

P (ω i | x), i = 1, . . . , C.

^(2.1)

Assume that the a priori probabilities

P (ω i ) i = 1, . . . , C

^, âre ^known. Îf ^they âre

unknown they anbeestimated fromour trainingdata as

P (ω i ) = ⁿ _N ⁱ

^where

N

^is^the ^total

number of training samples, and

n _i

^is ^the ^number ^of ^samples ^belonging ^to ^lass

ω _i

^. ^The

lass-onditional probability density funtions

p(x | ω i )

^, ^also ^known ^as ^likelihood ^funtions

of

ω i

^, ^with ^respet ^to

x

^, âre âlsoâssumed ^to^be ^known ^[29℄. Îf^these âre ûnknown ^we ^will

see later how they an be estimated.

1

To alulate the a posteriori probabilities we an

use Bayesrule [29℄

P (ω i | x) = p(x | ω i )P (ω i )

p(x) , i = 1, . . . , C,

^(2.2)

where

p(x) = X

i

p(x | ω i )P (ω i ).

Sine

p(x)

îsâ ômmon^fator ^forâll ^lasses ît^may ^be îgnored ând ^weân ûse

ω ^∗ = max

ω i

p(x | ω i )P (ω i ), i = 1, . . . , C,

^(2.3)

as our lassier. If the apriori probabilities

P (ω _i )

^, âre êqual Êq.^(2.3) ^redues ^to

ω ^∗ = max

ω i

p(x | ω i ), i = 1, . . . , C,

andthesearhforthemostprobablelassforfeature

x

^redues^to^evaluating^the^onditional

pdfs at

x

^. ^Sine ^the ^lassier ⁱⁿÊq. ^(2.3) îs ôbtained ûsing ^Bayes ^rule, îtîs ôften^referred

toas the Bayeslassier. Inthe two lass ase we use Eq. (2.3) to assign

x

^to ^lass

ω 1

^if

p(x | ω 1 )P (ω 1 ) > p(x | ω 2 )P (ω 2 ).

Our lassiation funtionan nowbe dened asto lassify

x

^to^lass

ω 1

^if

g(x) = p(x | ω 1 )P (ω 1 ) − p(x | ω 2 )P (ω 2 ) ≥ 0

2.2 Minimum probability of lassiation error for the

Bayes lassier

InFig.2.1,theBayesianlassierfortwoequiprobablelassesforaone-dimensionalfeature

vetor

x

^is illustrated. The region to the left of the dotted threshold linelearly ontains most of

p(x, ω ₁ ) = p(x | ω ₁ )P (ω ₁ )

^and ^we ^dene ^this ^as

R ₁

^, ^and ^the ^region ^to ^the ^right

1

If

x

^is^disrete^the^likelihood ^funtions^beomeprobabilitiesandaredenoted with

P ( x| ω _i )

(21)

Figure 2.1: Illustration of two deision regions. Figure borrowed from [29℄

of the threshold as

R 2

^. ^Let

R 1

^and

R 2

^be ^the ^regions ^where ^we ^lassify

x

^to

ω 1

^and

ω 2

^,

respetively. The total area overed by

p(x | ω ₁ )

ⁱⁿ^region

R ₂

^and

p(x | ω ₂ )

ⁱⁿ

R ₁

^will^be^the

probabilityofausinglassiationerrors. Ifthethresholdismovedtotheleftorright,this

areaand probabilitywillinrease. Thismeans thatif wewant tominimizethe probability

of an error, the deision regions

R ₁

^and

R ₂

^must ^be ^seleted ^by ^moving ^the ^threshold ^so

this area is assmallas possible.

In a multilass situationwith a multidimensionalfeature vetor

x

^we ^have

C

^dierent

lasses with deision regions

(R 1 , . . . R C )

^our ^feature ^vetor

x

ân ^be ^plaed în. ^Wê ^now

generalize the situation in Fig. 2.1. Writing the probability of a orret deision by the

joint probability

P (x ∈ R _i , ω _i ),

then the probability of erroneously assigning

x

^to

ω j

^by ^not ^seleting^the ^orret ^lass

ω i

is

P (x ∈ R j , ω i ), ∀ j 6 = i.

(22)

The total probability for ommittingan error inlassiation isthus

P e = [

∀j6=i

P (x ∈ R j , ω i ),

= X

∀j6=i

P (x ∈ R j | ω i )P (ω i ),

= X

∀j6=i

P (ω i ) Z

R j

p(x | ω i )dx,

= X

∀j6=i

Z

R j

P (ω i | x)p(x)dx.

The total probability for orretlassiation is

X

i

Z

R i

P (ω i | x)p(x)dx = 1 − P e .

Hene

P e = 1 − X

∀i

Z

R i

P (ω i | x)p(x)dx.

The error is learly minimizedwhen the regions

R _i

^are ^seleted ⁱⁿ ^a^way ^where

R i : P (ω i | x) > P (ω j | x), ∀ j 6 = i,

whihis the same as Eq. (2.1).

(23)

Density estimation

Theostfuntionsusedbythelassiersdisussedinthisthesisarealldependentonnding

somesortofdensityestimateforthelassdistributionsofthedata. Theasewhereweknow

the distributions of the feature vetors in eah lass

ω _i

^, ^given ^by ^the ^likelihood ^funtions

p(x | ω i )

^, ^is unfortunately not the reality for most data sets. We have to nd estimates of these distributions. There are basially two ategories of methods for estimation of

pdfs, parametri and non-parametri methods. In this setion, for ompleteness, a short

desription of parametri methods for density estimation is given. For more details [29℄

is reommended. The non-parametri methods are far more important in information

theoreti lassiation, sine we often annot assume that the data set has a parametri

distributionshape. Inpartiular theParzenwindowmethodfor density estimationwillbe

investigated. Throughoutthishapterweassumeadatasetof

N

^samples,

x i , i = 1, . . . , N

^,

generated i.i.dfrom unknown distributions,unless otherwise is speied.

3.1 Parametri methods

Assume that the pdf to be estimated is desribed in parametri form by some unknown

parameter vetor

θ

^, ^so ît ân ^be ^written âs

f (x; θ)

^. ^We ^have ^a ^limited ^number

N

^of

i.i.dtrainingdata,

x 1 , . . . , x N

^available^from^our distribution. Usingthese sampleswean use dierent methods to nd an estimate of the parameters in

θ

^suh ^that ^the ^estimate

f ˆ (x; θ)

îs âs ^lose âs ^possible ^to ^the ^true ^pdf. ^This ^means ^that ^we âssume ^that ôur ^data

is generated from a distribution with a shape that is lose to a parametri form, e.g we

assume a Gaussian, Rayleigh or some other well known distributionand try to adjust its

parameters to t our data as good as possible. The parameters in

θ

^are ^usually ^found

by maximum likelihood estimation. In [29℄ some methods for parametri estimation are

desribed, e.g. maximum likelihoodestimation, mixture models, maximum entropy et.

(24)

3.2 Non-parametri methods

To avoid the need to make assumptions about a parametri shape of the desired distri-

bution, we must often use non-parametri methods. In this setion, we review dierent

methodsofdensityestimationwithaone-dimensionalrandomvariable

x

^taken^from^a^on-

tinuous,univariatedensity funtion

f(x)

^. ^We^start^with^the ^simple^histogram^method^and

expandituntilweendup withthe Parzenwindowestimator. Someof themostimportant

properties of the Parzen window estimator are then disussed. In the last setion, the

Parzen windowmethodis expanded to the ase where we have multivariate distributions,

where the variable

x

^is ^amultidimensionalvetor.

3.2.1 The histogram density estimator

The oldest way to nd a non-parametri estimate of a funtion is given by the histogram

[30℄, [9℄. Given an origin

x 0

^, ^and ^a ^bin ^width

h

^, ^the ^bins ôf ^the ^histogram âre ^dened âs

[x 0 + mh, x 0 + (m + 1)h)

^for ^positive ^and ^negative ^integers

m

^. ^The ^histogram^estimate^of

the funtion

f (x)

^is ^then

f ˆ (x) = 1

Nh (

^no^of

x i

ⁱⁿ^same ^bin ^as

x).

^(3.1)

Thisestimatorisobviouslydisontinuousand notusablewhen weneedtondderivatives.

3.2.2 The naive density estimator

This is alsoa variantof the histogrammethod. Dene the pdf evaluatedat

x

^as

f (x) = lim

h→0

1 2h P (x − h < X < x + h)

^(3.2)

The probability

P (x − h < X < x + h)

ân ^be êstimated ^by ôunting ^the ^numberôf ^data

samplesfallingintoabinof size

2h

^entered ^at

x

^. ^This ^an^be^dened ^more^preisely^with

a weight funtion

W (x) = ₁

2

^if

| x | ≤ 1 0

^otherwise

,

suh that the naive estimator an beexpressed as [9℄

f ˆ (x) = 1 Nh

X N i=1

W

x − x i

h

.

^(3.3)

Introduinga resalingnotation

W h (u) = h ⁻¹ W (u/h)

^we^rewrite ^Eq. ^(3.3) ^as

f(x) = ˆ 1

N X N

i=1

W h (x − x i ).

^(3.4)

From Eq.(3.3) wesee that anestimateforthe pdf at

x

îs ^given ^by ^plaingâ^box âround

eah sample

x i

^with ^width

2h

^and ^height

(2Nh) ⁻¹

ând ^sum ûp. ^This êstimator îs ^not

(25)

3.2.3 The Parzen window density estimator

Parzen generalized the weight funtion

W _h ( · )

ⁱⁿ Êq. ^(3.4) ^toâ ^kernel ^funtion ôr ^Parzen

window whihis a funtionsatisfying [29℄

K h (x) ≥ 0

^and

Z

x

K h (x)dx = 1.

Thesubsript

h

^refers ^to^the^bandwidth ôr^window^width ôf ^the^kernel^[30℄. Ûsually

K( · )

^is

hosen tobe aunimodalprobability density funtion that issymmetriaroundzero. This

makes sure that the estimator

f(x) = ˆ 1 N

X N i=1

K h (x − x i )

^(3.5)

produes anestimatewhih isalsoa density.

Toinvestigatesomeofitsproperties,wendexpressionsforthemeanvalueandvariane

of Eq. (3.5). Let

f(x) ˆ

^be ^the ^estimate ^of ^the ^true ^density

f (x)

^at

x

^, ^with

x ^′

^a ^random

variable with density

f (x)

^. ^Then

E { f ˆ (x) } =E { K h (x − x ^′ ) }

= Z

K h (x − x ^′ )f(x ^′ )dx ^′

=K h (x) ⋆ f (x).

^(3.6)

The density estimate is therefore a smoothed version of the true density. The bias of the

estimatoris given by [30℄

E { f ˆ (x) } − f (x) = [K h (x) ⋆ f (x)] − f (x),

^(3.7)

and the variane is [30℄

V ar { f ˆ (x) } =E { [ ˆ f(x) − E { f ˆ (x) } ] ² }

= 1

N { (K _h ² (x) ⋆ f (x)) − [(K h (x) ⋆ f (x)] ² } .

^(3.8)

It is ommonto measure the loseness of the estimator

f ˆ (x)

^to^the ^target ^density

f(x)

ⁱⁿ

the point

x

^by ^the ^size ^of ^the ^mean ^squared ^error ^(MSE)

MSE { f(x) ˆ } = E n

[ ˆ f (x) − f(x)] ² o ,

whihan bewritten as

MSE { f ˆ (x) } = 1

N { (K h (x) ² ⋆ f (x)) − [(K h (x) ⋆ f (x))] ² } + { (K h (x) ⋆ f (x)) − f(x) } ²

=V ar { f(x) ˆ } + [Bias { f ˆ (x) } ] ² .

(26)

Instead of just estimating the funtion

f (x)

ât â ^single ^point ^we ^want ^to êstimate ît ôver

the whole

x

^-spae. ^The ^mean întegrated êrror ^(MISE) îs â ^more appropriate measure for analyzing

f ˆ (x)

^where

MISE { f(x) ˆ } = Z

MSE { f ˆ (x) } dx

= Z

E { [ ˆ f(x) − f (x)] ² } dx + Z

V ar { f ˆ (x) } dx.

^(3.9)

The biasand varianeterm inEq.(3.9)depend onthe kernel width

h

ⁱⁿ^dierent^ways. ^It

has been shown that Eq. (3.9) for large sample sizes

N

^, ^the ^asymptoti ^mean ^integrated

squared error (AMISE) isgiven by [14℄,[30℄

AMISE { f(x) ˆ } = (Nh) ⁻¹ R(K ) + 1

4 h ⁴ µ 2 (K) ² R f ^′′

(3.10)

where

µ 2 (K) = R

z ² K(z)dz

^,

R(f ^′′ ) = R

{ f ^′′ (x) } ² dx

^,

f ^′′ (x) = _d ^d 2 ² x f (x)

^and

R(K) = R K(z) ² dz

^. ^We^see^that^minimizing^the^left^term^(the^variane)^with^a^large^kernel ^window

h

^results ⁱⁿ â ^huge înrease ⁱⁿ ^the ^bias ^part ^whih îs proportional to

h ⁴

^. ^This ^is ^what ^is

known as the variane-bias trade-o in kernel size seletion. There exists many ways to

nd the kernel size

h

^. ^W^e ^mention ^two ^popular ^methods ^here. ^The ^rst dierentiates Eq. (3.10) and equates itto zero, obtaining

h AM ISE =

R(K) µ ² ₂ (K)R(f ^′′ )N

¹ ₅

The other method estimates

R(f ^′′ )

^by âssuming ^that ^the ^true ûnderlying ^density îs â

normal density. Then the kernel size is given by [9℄

h AM ISE = 1.06N ⁻ ¹ ⁵

Several other methodsexist, see [14℄ and [30℄.

3.2.4 The multivariate Parzen window

TheextensionoftheParzenwindowtofeaturedataina

d

-dimensionalspaeisalittlemore diult. The sparseness of data in higher dimensional spaes makes the estimation more

diult, unless we have very many samples. This phenomenon is usually referred to as

the urse of dimensionality. Rememberingthat the kernel funtioninthe one-dimensional

ase speify the window width,this window willinthe multidimensionalase be replaed

with hyperubesand eah dimensionof the ube requiresa parameter tobe estimated for

the kernel. A diret extension of the univariate kernel estimate in Eq. (3.5), is obtained

by replaing the point

x

^with ^a vetor-point

x ∈ R ^d

^and ^the ^variable

x i

^with ^a

d

^-variate

sample

x _i

^with ^density

f(x)

^. ^The ^Parzen ^estimator^beomes ^[30℄

f(x) = ˆ 1 N

X N i=1

K ^H (x − x _i )

^(3.11)

(27)

where

H

^is ^a ^symmetri^positive^denite

d × d

^matrix ^alled ^the ^bandwidth^matrix

K ^H (x) = | H | ^−1/2 K(H ^−1/2 x).

With further restritions on

H

^, ^see ^[30℄ ^for ^details, ^we ^get ^the ^single ^bandwidth ^kernel

estimator

f(x) = ˆ 1 Nh ^−d

X N i=1

K { (x − x _i )/h } .

^(3.12)

There exists several methodstogivean estimateofthe optimalkernelsize for amulti-

variatedataset. The optimalkernelsizeisusuallyseleted tominimizetheMISE between

f ˆ (x)

^and ^the ^target ^density

f (x)

^. ^The ^normal ^referene ^rule ^for ^the ^MISE ^kernel ^size ^is

given by Silverman'srule [28℄, [9℄

ˆ h = σ x

4 (2d + 1)N

_d+4 ¹

,

^(3.13)

where

σ _x ² = d ⁻¹ P

i Σ ii

^and

Σ ii

âre ^the ^diagonalêlementsôf ^the ^sample ôvariane^matrix.

Due to the urse of dimensionality this method is not regarded as reliable for higher di-

mensionaldata. In this thesis most of the data sets are of higher dimensions, so we have

hosen a ross validationtehnique to nd the best kernel sizes inthe density estimates.

(28)

Some well-known Parzen windows with

u = ^x−x _h ⁱ

^are ^listed ^below, ^where ^for

u ≥ 1

^all

windows evaluate to zero, exept the Gaussian kernel. Forsimpliity we only present the

one-dimensional versions,but they an easilybeextended to multivariateversions.

•

^Uniform

K h (u) = 1/2.

•

Epanehnikov

K h (u) = 3

4 (1 − u ² ).

•

^Gaussian

W _σ ² (x) = 1

√ 2πσ ² exp (

− (x − x _i ) ² 2σ ²

) .

•

^Quarti

K h (u) = 15

16 (1 − u ² ) ² .

•

^T^riweight

K h (u) = 35

32 (1 − u ² ) ³ .

•

^Cosinus

K h (u) = π

4 cos π 4 u

.

(29)

Information theoreti learning priniples

This hapter starts with a brief introdution of information theory. An overview of the

onepts inaninformationtheoreti learningmahineisthen given. Informationtheoreti

riteria,whihgivesusthetoolstomeasuretheshapeof,anddistanebetween,probability

distributions,is then explainedindetail. Inthe lastsetion wedisuss some ost funtion

estimators used in ITL.

4.1 Information theory

Information theory is in this thesis related to C.E Shannon's report from 1948, A math-

ematial theory of ommuniation [25℄. Shannon dened a measure of information or

unertainty assoiated with a stohasti experiment and named it entropy. This measure

was used toanswer importantquestionsin ommuniation. Shannon used entropy tond

alimittohowmuhinformationanbetransferred overanoisyhannel,and tond ways

todesign optimalodes for data ompression.

Entropy an bethoughtof asthe unertaintyassoiated withthe value ofa realization

ofasingle randomvariable. Itisameasure onhowmuhinformationthatisgainedabout

the ontent of astohasti random variable aftera stohasti experiment.

Shannon also dened a measure alled mutual information, whih is the amount of

information that one random variable arries about another, i.e. the redution in the

unertaintyofone randomvariableduetotheknowledgeofthe other. Mutualinformation

is a speial ase of a more general quantity, alled relative entropy. The relative entropy

or divergene an be used as a measure of distane between two distributions. It is a

measure of the ineieny of assuming that a distribution is given by a density funtion

q( · )

^,^when^itⁱⁿ^fat^has ^adistributiongiven byadensityfuntion

p( · )

^. ^This^is^also^refereed

toas adivergene measure between distributions. In this thesis we use dierent estimates

of entropy and divergene asinformationtheoreti measures.

(30)

Figure 4.1: A general learning mahine.

4.2 An information theoreti learning mahine

A very general learningmahine may be desribed using a modellikethe one in Fig. 4.1.

Mahine learning is in general divided into supervised, semi-supervised and unsupervised

tasks. The model in Fig. 4.1 an be used to desribe all of the above mahine learning

tasks. Wehave some input data

X

^,^ontaining informationor measurements from areal- world event. We want the learning mahine to perform some spei task on

X

^. ^This ^is

done by givingthe inputdata

X

^to ^a^possibly ^non-linear ^parametri ^mapping^funtion

g : R ^d → R ^M ,

^(4.1)

whihtransforms the input vetor

X ∈ R ^d

^to

Y ∈ R ^M

Y = g (X, W ),

^(4.2)

where

W

âre^the^parametersôf^the^mapping^funtion. Îf^theôptimality^riterionîs^basedôn

aninformationtheoretimeasure, eitherentropyordivergene, weallthis aninformation

theoreti learningmahine. The mapperfuntionin Eq. (4.2)transform theinput data to

a new formdepending onthe task of the learningmahine. The outputof the mapper,

Y

isompared with anoptimality riterionand optionallyadesired response

z

^for ^the^input

X

^. ^F^or ^eah presentation of training data the optimality riterion is evaluated and the error term

e

îs^fed ^to ânâdaptation âlgorithm^whih ûpdate ^the ^parameters

W

^.

Supervisedlearningonerns alearningmahinewith adesiredresponse foreahinput

X

^. ^This ^thesis ^fous ôn ^supervised ^learning, ^speially ôn ^lassiation ôf ^data. Îf ^the

learningtask is lassiation,the desiredresponse

z

^for ^the ^training ^data ^ontains ^a ^lass

(31)

label. Thetaskisthentolassifyarandominputsampletooneof

C

^lasses. ^The^mapping

funtion may inthis ase bedesribed as

g : R ^d → { y 1 , . . . , y c } ,

where

y 1 , . . . , y c

^are^the^possible^lass^labels, ^and

M = 1

ⁱⁿÊq.^(4.1). ^Regressionîsânother

example of atypialsupervised learningtask, where the mappingis

g : R ^d → R .

Semi-supervised learningtasksonerns learningwhere thelabelsofthe inputdataare

only partially known. One example is ranking where we only have available the relative

orderingofthetheexamplesinthetrainingset,whileouraimistoenableasimilarordering

of noveldata.

Unsupervised tasksare learningtaskswhere thewanted informationfromthe datahas

tobeextrated withoutany desiredresponse inthe optimalityriterion. Clustering isone

typialexampleofunsupervisedlearning,wheretheaimistondanaturaldivisionofdata

into homogeneous groups [26℄. Anomaly and novelty detetion are other examples, where

the taskistodetetsamplesthatdeviate fromthenormal. Other importantunsupervised

tasks are nding low-dimensional representations of the input data, important examples

ofthis isprinipal omponentanalysis (PCA)and independentomponent analysis (ICA).

In PCA, the mapping inEq. (4.1) aims to projet the input

X

^to^a ^lower

M

-dimensional spae, where

M

^denotes ^the ^number ^of unorrelated features in

X

^. În ÎCA ^the ^goal îs ^to

projet

X

^to ^a ^lower

M

-dimensional spae where eah of the features of

X

^are ^mutually

independent.

Traditionallythe riteria foroptimalityinFig. 4.1has been tominimizethe MSEost

funtion between the output

Y

ôf ^the ^mapperând ^the ^desiredôutput

z J(Y ) = E

(z − Y ) ² .

^(4.3)

Inourgeneralmahinelearningmodelwewanttotransferasmuhinformationaspossible

aboutourdataintothemapperfuntion

g(X, W )

^,^suh^that^this^mapper^is^able^to^desribe

our data as aurately as possible. The optimality riterion is thus ritial in obtaining

the parameters

W

^. ^If ^we^use ^MSE, ^the informationtransferred fromthe measurements

X

andthe desiredresponses

z

^to^the^parameters

W

îs^purely^basedôn^seondôrder^statistis

onstraints. This is only optimal if the input data is drawn from Gaussian distributions,

whihis arather strit restrition.

To transfer as muh information as possible to the parameters

W

^the ^error ^term ⁱⁿ

Fig. 4.1 must be omputed with a riterion transferring as muh information as possible

about the input data

X

^, ^and ^the ^desired ^response

z

^, ^to ^the ^parameters

W

^of ^the ^mapper

Y = g (X, W )

^. Îf ^we ^base ôur âlulations ôn informationtheory and optimize with information theoreti riteria in Fig. 4.1 and Eq. (4.2) we have what Prinipe et al desribes

as information theoreti learning [18℄. The main advantage with information theoreti

riteriaare that they are funtions of probabilitydensities and apture alldata statistis,

not justthe seond-order statistis. This givesuslearningmahineswhere the parameters

W

^desribes ^our ^data ⁱⁿ^a ^muh ^better ^way ^than ^the traditionalMSEan.

(32)

4.3 Information theoreti quantities

In this setion we disuss some important information theoreti quantities that may be

used as informationtheoreti riteriain anITL mahine.

Denitions A disrete stohasti variable

X

îs âssoiated ^with â ^triple

(x, A ^X , P ^X )

^,

where the outome

x

îs ^the ^value ôf ^the ^stohasti ^variable ^whih ^takes ôn â ^set ôf

possible values

A ^X = { a ₁ , a ₂ , . . . , a _N }

^. ^These ^have probabilities (distribution)

P ^X = { p 1 , p 2 , . . . , p N }

^, ^where

P (x = a _i ) = p i

^,

p i ≥ 0

^and

P

a _i ∈A _X P (x = a _i ) = 1

^.

A ontinuous stohasti variable

X

îs âssoiated ^with â probability density funtion

f ^X (x),

^where^the^outome

x

îs^the^valueôf^the^stohasti^variable. ^The^pdfîs^denedâs^the

derivative of the umulativedistribution funtion (df), dened as

P (X ≤ x ₀ ) = F ^X (x 0 )

^,

where

0 ≤ F ^X (x) ≤ 1

^. ^Hene,

f ^X (x 0 ) = _∂ ^∂ x F ^X (x) | ^x = x ₀

^, ^and

R ∞

−∞ f ^X (x)dx = 1

^.

We have statistialindependene between random variables

X ₁ , . . . , X _N

îf ând ônly îf

f (x 1 , . . . , x _N ) = Q N

i=1 f(x i )

^.

A metri ona set

X

^is ^a ^funtion

u : X × X → R

^. ^F^or ^all ^x,y^,z ⁱⁿ

X

^, ^this ^funtion ^is

required tosatisfy the followingonditions

1.

u(x, y) ≥ 0

^.

2.

u(x, y) = 0

îf ând ônly îf ^x=y^.

3.

u(x, y) = u(y, x)

^.

4.

u(x, z) = u(x, y) + u(y, z)

^.

4.3.1 Entropy

All information theoreti riteria are related to the onept of entropy. We now explain

Shannon's measure of entropy and some of it's properties. A more general version of

entropy, the Renyi entropy is then reviewed.

Shannon's entropy

Assume there is some unertainty in the outome of an randomexperiment and that the

possible outomes of the experiment is given by a probability distribution. This uner-

tainty was rst quantied by Shannonas

H = H _N (p ₁ , p ₂ , . . . , p _N )

^satisfying^the ^following

riteria[25℄ [9℄

1.

H N (p 1 , p 2 , . . . , p N )

îs â^symmetri ^funtionôf îts^variables.

As anexample,

H N (p 1 , p 2 , . . . , p N ) = H N (p 2 , p 1 , . . . , p N ).

2.

H N (p 1 , p 2 , . . . , p N )

îs âôntinuous ^funtion ôf

p 1 , p 2 , ...., p N

^.

3.

H N ( _N ¹ , . . . , _N ¹ )

^attains ^the ^maximum ^value.

(33)

4.

H N +1 (tp 1 , (1 − t)p 1 , p 2 , . . . , p N ) = H N (p 1 , p 2 , . . . , p N ) + p 1 H 2 (t, 1 − t)

^for ^any ^distri-

bution

p X

^and

0 ≤ t ≤ 1

^.

The fourth property of Shannon entropy may be explained as follows [25℄. If a hoie

is broken down into two suessive hoies, the original entropy (

H

⁾ ^is ^the ^weighted ^sum

of the individual values of

H

^[25℄. ^This ^isillustrated in Fig. 4.2.

Figure 4.2: At the left we have three possibilities, eah hosen aording

to the probabilities

p 1 = ¹ ₂

^,

p 2 = ¹ ₃

^,

p 3 = ¹ ₆

^. ^On ^the ^right, ^we ^rst ^hoose

between two possibilities eah with probability

1

2

^. ^If ^the ^seond possibility is hosen, we make another hoie with probabilities

2 3

^and

1

3

^. ^The ^nal

results have the same probabilities as before. We require, in this speial

ase, that

H( ¹ ₂ , ¹ ₃ , ¹ ₆ ) = H( ¹ ₂ , ¹ ₂ ) + ¹ ₂ H( ² ₃ , ¹ ₃ ).

^The

¹ ₂

^oeient ^is ^beause

this seond hoie only ours half the time.

Shannon showed that the only

H

^satisfying^the ^above assumptions is[25℄

H _N (p ₁ , p ₂ , . . . , p _N ) = H _N ( P ^X ) = − K X

p i ∈P _X

p _i log _b p _i ,

^(4.4)

withtheonventionthat

0 log _b 0 = 0

^. ^This^measure^heâlledêntropy,^beauseîtîs^the^same

expression used to deneentropy in statistialmehanis.

K

^is ^some^onstant, ^depending

of the units of the sample data. If the base

b = 2

^, ^the êntropy ûnit îs^bits ând îf

b = e

^the

unitisnats. Inthis thesisweleavethebase

b

ôf^the^logarithmûnspeied,^sineîtîs^justâ

measurement sale. Entropy isusually denoted by

H(X)

^where

X

îs â^label ^forâ ^random

variable, and not the argument of afuntion. Shannon's entropy depend onthe quantity

I(p i ) = − log p i ,

^(4.5)

(34)

whihwas proposed by Hartley asa measure of the informationreeived by learningthat

a single event of probability

p i

^took ^plae ^[7℄. ^Hene ^the ^Shannon êntropy îs â ^weighted

average of informations

I(p _i )

H(X) = E { I(p i ) } .

^(4.6)

Properties of Shannon entropy

SeveralpropertiesoftheShannonentropyanbederivedbasedonthefourbasiproperties

[9℄[25℄ [3℄

1. Addingorremovinganeventwithprobabilityzerodoesnotontributetotheentropy,

H N (p 1 , p 2 , . . . , p N , 0) = H N (p 1 , p 2 , . . . , p N ).

2. It vanishes when one outome is ertain,

H _N (p ₁ , p ₂ , . . . , p _N ) = 0

^,

p _i = 1

^,

p _j = 0, j 6 = i, i = 1, . . . , N

^.

3. The maximum of

H N

^inreases ^as

N

^inreases.

4.

H N ≥ 0

^.

Example ToillustratesomepropertiesofShannon entropyletthe stohastivariable

X

be given by

X =

1,

^with probability

p 0,

^with probability

1 − p

^.

The entropy inthis ase is

H(X) = − p log(p) − (1 − p) log(1 − p),

asshown inFig.4.3asafuntionof

p

^. ^Noteⁱⁿ^Fig.^4.3^that^the ^entropy ^is^zero^when

p = 0

or

p = 1

^,^meaning^thereîs^noûnertainty âbout^the ôutome ôf^the ^stohastiexperiment.

If

p = ¹ ₂

^the ûnertainty îs ^maximized, ând ^we ^need ôn âverage ¹ ^bit ^to ^transmit ^the

outome of the experiment.

(35)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.2 0.4 0.6 0.8 1

p

H(p)

Figure 4.3:

H(X) = H(p)

ⁱⁿ ^bits ^with ^Shannon^'s ^entropy, ^notie ^that

H(X) = 1

^, ^when

p = ¹ ₂

^.

(36)

Shannon dierential entropy

Fora ontinuous stohasti variable

X

^with ^density

f (x)

¹ ^the ^dierential^entropy

h(X)

^is

dened as [25℄ [3℄,

h(X) = − Z

f(x) log f (x)dx.

^(4.7)

This an alsobewritten asan expeted value

h(X) = E f {− log f(x) } .

^(4.8)

Properties of Shannon's dierential entropy

1. If

X

îs ^limited ^to â êrtain ^volume

v

ⁱⁿ ^spae, ^then

h(X)

îs ^maximum ând êqual ^to

log v

^when

f (x)

^is ^onstant ^(uniform^density ^funtion)ⁱⁿ ^the ^volume.

2. Dierential entropy may be negative. If we onsider the uniform density above, for

v < 1

^,

log v < 0

^.

3. The normal distributionmaximizes the entropy over alldistributions with the same

ovariane. This property an be exploited to measure the non-Gaussianity of a

stohasti variable.

4. The dierentialentropy is ameasure that is relativetothe oordinatesystem. Con-

sider for example hanging oordinates by a linear transformation,

Y = MX

^. ^In

that ase,

h(Y) = h(X) + log | det(M) | .

Renyi's entropy

Asexplained above,Shannon'sentropy isameasure ofthe average amountofinformation

ontained ina single observation of a random variable. Renyi used a more generaltheory

of mean values, where the mean of the real numbers,

x ₁ , . . . , x _N

^, ^with ^positive ^weighting

p 1 , . . . , p N

^, ^has ^the ^form^[18℄ ^[22℄

x = ϕ ⁻¹ X N

i=1

p i ϕ(x i ),

^(4.9)

where

ϕ(x)

^is^aKolmogorov-Nagumofuntion,whihisanarbitraryontinuousandstritly monotonifuntiondened onthereal numbers. Hefound thatageneralentropy measure

H

^obeys ^the ^relation^[18℄

H = ϕ ⁻¹ X N

i=1

p i ϕ(I(p i ))

!

,

^(4.10)

1

(37)

where

I(p i )

^is ^Hartley's information measure. In order to be an information measure

ϕ( · )

^annot ^be ^arbitrary^, ^sine information is additive. We have two hoies,

ϕ(x) = x

or

ϕ(x) = 2 ^(1−α)x

^. ^The ^rst âse ^gives ^Shannon's êntropy ând ^the ^seond ^gives ^Renyi's

entropy of order

α

^[18℄

H _R _α (X) = 1 1 − α log

X N k=1

p ^α _k

!

α > 0, α 6 = 1.

^(4.11)

There is a well known relation between Shannon's and Renyi's entropy. Let

H S

^denote

Shannon's entropy, then [18℄

H _R _α ≥ H _s ≥ H _R _β

^if

0 < α < 1

^and

β > 1,

α→1 lim H R α = H S .

Renyi's and Shannon's entropies an also be related to eah other in another way. If we

onsiderthe probabilitymass funtion

P = (p 1 , p 2 , . . . , p N )

^as^a^pointⁱⁿ ^theN-dimensional spae, this point will always be in the rst quadrant of a N-dimensional hyperplane with

eahaxis interseting the oordinateone. The distane of the point

P

^to^the ^origin^is ^the

α

^root ^of

d α = X N k=1

p ^α _k = || P || ^α

and the

α

^root ^of

d α

^is ^alled ^the

α

^-norm ^of ^the probability mass funtion [18℄. Renyi's entropy satises all of Shannon's riteria in 4.3.1 on page 23. Exept of the fourth prop-

erty. The Renyi's entropy of order

α = 2

^, îs ^denoted ^by ^Renyi's ^quadrati êntropy ând

orresponds to the 2-normof the probabilitymass funtion.

If werepeat the exampleon page24with Renyi'squadrati entropy measure, we get a

similar shape inFig. 4.4, as inShannon's entropy in Fig. 4.3onpage 25.

(38)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.2 0.4 0.6 0.8 1

p

H R 2 (p)

Figure 4.4:

H(X) = H(p)

ⁱⁿ ^bits ^with ^Renyi's ^quadrati ^entropy ^measure, ^notie ^that

H(X) = 1

^when

p = ¹ ₂

^.

(39)

Renyi's dierential entropy

For the ontinuous random variable

X

^with ^pdf

f (x)

^we ^obtain ^the ^dierential ^version ^of

Renyi's entropy [18℄

h _R _α (X) = 1 1 − α log

Z

f ^α (x)dx,

^(4.12)

= 1

1 − α log E _f { f ^1−α (x) } .

^(4.13)

Ifweset

α = 2

^, ^we ^get ^the ^dierential^Renyi ^quadrati ^entropy

h R 2 (X) = − log

Z

f ² (x)dx,

^(4.14)

= − log E f { f(x) } .

^(4.15)

Some properties of the Renyi entropy of order

α

^are ^the ^following^[9℄

1. Just asfor Shannonentropy, theRenyi entropyis maximizedfora uniformdistribu-

tion for randomvariables with nite support.

2. The Renyi entropy is not in general maximized by the Gaussian distribution in the

xed varianease.

3. The Renyi entropy is invariantto rotationsand translations.

4.3.2 Divergene

This setionreviews some of the mostommonmeasures ofdivergene orrelative entropy

used in information theoreti learning. Divergene is used as a measure of statistial

similarity, and one an think of it as a generalization of algebrai distane measures to

probability spaes [5℄.

Kullbak-Leibler divergene

This measure disriminates two probability density distributions

p(x)

^and

q(x)

^, ând ît îs

alsoreferred to asrelative entropy

D _KL { p, q } = Z

p(x) log p(x) q(x) dx,

= E p

log p(x) q(x)

.

^(4.16)

Some properties of this measure are [9℄

Information theoretic learning for pattern classification

(i, j)

i

j

•

•

•

•

W σ 2 ( · , · )

σ 2 K h ( · , · )

h F

d

C

p(x)

P (x)

−∞

∞

x

d

f

E f {·}

•

•

•

•

•

•

•

•

•

•

g : R d −→ {− 1, 1 } ,

p(x, y ), (x 1 , y 1 ), . . . , (x N , y N ) ∈ R

× Y, Y {− 1, 1 }

g

x

x

+1

g(x) ≥ 0

x

p(x, y)

x

C

ω 1 , . . . , ω C

x

P (ω i | x), i = 1, . . . , C

x

ω ∗

ω ∗ = max

ω i

P (ω i | x), i = 1, . . . , C.

P (ω i ) i = 1, . . . , C

P (ω i ) = n N i

N

n i

ω i

p(x | ω i )

ω i

x

P (ω i | x) = p(x | ω i )P (ω i )

p(x) , i = 1, . . . , C,

p(x) = X

i

p(x | ω i )P (ω i ).

p(x)

ω ∗ = max

ω i

p(x | ω i )P (ω i ), i = 1, . . . , C,

P (ω i )

ω ∗ = max

ω i

p(x | ω i ), i = 1, . . . , C,

x

x

x

ω 1

p(x | ω 1 )P (ω 1 ) > p(x | ω 2 )P (ω 2 ).

x

ω 1

g(x) = p(x | ω 1 )P (ω 1 ) − p(x | ω 2 )P (ω 2 ) ≥ 0

W _σ ² ( · , · )

σ ² K h ( · , · )

g : R ^d −→ {− 1, 1 } ,

ω ^∗

ω ^∗ = max

P (ω i ) = ⁿ _N ⁱ

n _i

ω _i

ω ^∗ = max

P (ω _i )

ω ^∗ = max

p(x, ω ₁ ) = p(x | ω ₁ )P (ω ₁ )

R ₁

P ( x| ω _i )

p(x | ω ₁ )

R ₂

p(x | ω ₂ )

R ₁

R ₁

R ₂

P (x ∈ R _i , ω _i ),

R _i

ω _i