• No results found

Information theoretic learning for pattern classification

N/A
N/A
Protected

Academic year: 2022

Share "Information theoretic learning for pattern classification"

Copied!
121
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Master's Thesis in Eletrial Engineering

Information Theoreti Learning for

Pattern Classiation

by

Ola Kvisle Storås

Deember, 2007

FACULTY OF SCIENCE

Department of Physis and Tehnology

University of Tromsø

(2)
(3)

Thisthesisisastudyofpatternlassiationbasedoninformationtheoretiriteria. Infor-

mationtheoreti riteriaareimportantmeasures basedonentropy anddivergene between

data distributions. First, the basi onepts of pattern lassiation with the well known

Bayes lassiation rule as a startingpointis disussed. We disuss how the Parzen win-

dow estimator may be used to nd good density estimates. The Parzen window density

estimator an be used to estimate ost funtions based on information theoreti riteria.

Furthermore,we explain amodelof aninformationtheoreti learningmahine. Withost

funtionsbased oninformationtheoreti riteria,weargue thatalearningmahinepoten-

tiallyapturesmuhmore informationaboutadata setthan thetraditionalmeansquared

error ost (MSE) funtion. We nd that there is a geometri link between information

theoreti ost funtions estimated using Parzen windowing, and mean vetors in a Mer-

er kernel feature spae. This link is used to propose and implement dierent lassiers

based on the integrated squared error (ISE) divergene measure, operating impliitly in

a Merer kernel feature spae. We also apply spetral methods to implement the same

ISE lassiers working inapproximations of Merer kernel feature spaes. We investigate

the performane of the lassiers when we weight eah data pointwith the the inverse of

the probability density funtion at that point. We nd that the ISE lassiers working

impliitly in the Merer kernel feature spae performs similar to a Parzen window based

Bayes lassier. Usinga weighted inner-produt denition gives slightly better results for

some data sets, while for other data sets the lassiation rates are slightly worse. When

omparingtheresults between theimpliitISElassierusing unweighted datapointsand

theParzenwindowBayeslassier,someoftheresultsindiatethattheISElassierfavor

the lasses with highest entropy.

(4)
(5)

I would like to thank my supervisor in this thesis, Robert Jenssen, for introduing and

explainingmanynewfasinatingideasinpatternanalysisduringtheworkwiththisthesis.

(6)
(7)

1 Introdution 1

1.0.1 Quik summaryof ontent inthis thesis. . . 3

1.1 Denitionsand notation used . . . 4

1.2 Struture and literature . . . 4

I Theory 7 2 Pattern lassiation basis 9 2.1 The Bayes lassier . . . 9

2.2 Minimum probabilityof lassiation error for the Bayes lassier . . . 10

3 Density estimation 13 3.1 Parametri methods . . . 13

3.2 Non-parametrimethods . . . 14

3.2.1 The histogramdensity estimator . . . 14

3.2.2 The naive density estimator . . . 14

3.2.3 The Parzen windowdensity estimator. . . 15

3.2.4 The multivariateParzen window . . . 16

4 Information theoreti learning priniples 19 4.1 Informationtheory . . . 19

4.2 Aninformation theoreti learningmahine . . . 20

4.3 Informationtheoreti quantities . . . 22

4.3.1 Entropy . . . 22

4.3.2 Divergene . . . 29

4.4 Estimationof informationtheoreti ost funtions . . . 32

4.4.1 Renyi quadratientropy estimate . . . 33

4.4.2 ISE divergene estimate . . . 34

4.4.3 Cauhy-Shwarz divergene estimate . . . 34

4.4.4 Usingnon-Gaussian kernels in estimation of ost funtions . . . 35

4.4.5 Shannon entropy estimate . . . 35

(8)

5.2 Informationmeasures in the Merer kernel spae . . . 38

5.2.1 ISE divergene . . . 39

5.2.2 Cauhy-Shwarz divergene . . . 40

5.3 The standard ISE lassier . . . 42

5.3.1 Connetionsto Parzen windowBayes lassier . . . 43

5.3.2 Multilassstandard ISE lassier . . . 44

5.3.3 UsingNon-Merer kernels . . . 45

5.3.4 Connetionto the graph ut . . . 45

6 A Laplaian ISE lassier 47 6.1 Modied ISE divergene . . . 48

6.2 Connetionto the Bayes probability of error . . . 48

6.3 Kernelspae and Laplaian matrix representation . . . 49

6.3.1 Illustrationof weights . . . 52

7 Spetral ISE lassiers 55 7.1 Mappingof data to aMerer based feature spae . . . 56

7.1.1 Illustrationof mappings to Merer spae . . . 57

7.2 Spetral versions of the ISE lassiers . . . 59

7.2.1 The spetral ISE lassier . . . 59

7.2.2 The spetral LaplaianISE lassier . . . 59

II Analysis and experiments 61 8 Kernel seletion 63 8.1 Eet of kernel bandwidth and kernel type . . . 63

9 Classiation experiments 69 9.1 Introdution . . . 69

9.2 Seletion of data sets and lassiation methods . . . 69

9.3 Standard ISE . . . 70

9.3.1 Confusionmatries Standard ISE and Bayes . . . 72

9.3.2 Summaryof lassiation results for the standard ISE lassier . . 76

9.4 LaplaianISE . . . 77

9.4.1 Summaryof lassiation results for the LaplaianISE lassier . . 84

9.5 Spetral ISE . . . 84

9.5.1 Summaryof lassiation results for the spetral ISE lassier . . . 89

9.6 Spetral LaplaianISE . . . 90

9.6.1 SummaryoflassiationresultsforthespetralLaplaianISElas-

(9)

10 Conlusion 97

10.1 Furtherwork . . . 97

Appendix 99

A Laplaian ISE lassier 99

A.1 Averageonfusion matriesLaplaian ISE lassier . . . 99

B Spetral ISE lassier 101

B.1 Averageonfusion matriesSpetral ISE . . . 102

C Spetral Laplaian ISE lassier 105

C.1 Averageonfusion matriesfor the spetralLaplaian ISElassier. . . 106

Bibliography 109

(10)
(11)

Introdution

Patterns are used to desribe any relations, regularities or struture inherent in a data

set generated by a soure. Similar patterns are often grouped into a lass. By deteting

signiant patterns in the available data, a learning mahine an make preditions about

new data oming fromthe same soure [26℄. In patternlassiation we use some labeled

trainingdataset, whereeahlabelrepresent alass,and usealearningmahinetopredit

the orret lass label for a new unlabeled sample. A learning mahine may be viewed

as a devie that adjust a set of parameters through a learning proess. For eah new set

of training data given to the mahine, the parameters are updated using a riteria that

aptures thewanted informationtodesribethedata inanewform. Patternlassiation

isone ofthe fundamentalproblemsinmahine learningand signalproessing[29℄, [26℄,[4℄.

Itisanimportantpartinomputervision,medialimaging,optialharater reognition,

geostatistis, handwriting reognition, biometriidentiation, natural language proess-

ing,doumentlassiation,emailspamdetetionandreditsoring,tolistafewexamples

[29℄, [4℄, [23℄,[1℄, [15℄, [8℄.

Information theoreti learning (ITL) methods emerged in [18℄ and [5℄. Information

theoreti learning here refers to the use of a general learning mahine as we desribe

in Setion 4.2, where a riteria related to Renyi's quadrati entropy is used to update

the learning proess. The riteria often use Renyi's quadrati entropy to nd divergene

measures between data in dierent distributions. A divergene measure an be thought

of as a generalization of algebrai distane measures (suh as the Eulidean norm) to

probability distributionspaes [5℄. In [18℄ two important informationtheoreti divergene

measures were presented, one is based on the Cauhy-Shwarz (CS) inequality and the

other onthe integrated squared error (ISE)between two probability distributions.

The Renyi's quadrati entropy is estimated using a Parzen window density estimation

method, and may be thought of as a generalization of variane to proesses with non-

Gaussiandistributions[5℄. In[18℄and [5℄informationtheoreti oneptsare explainedand

used in time series predition, independent omponent analysis (ICA), feature extration

and blind soure deonvolution.

Independent of ITL, a number of kernel methods have emerged in the reent years.

Kernel methods generally solve mahine learning problems in two parts: A module, also

(12)

known as a kernel funtion, performs a mapping of data to a new feature spae. In this

newfeaturespaealearningalgorithmisusedtodisoverlinearpatterns [26℄. Theuse ofa

kernelfuntionallows ustooperateimpliitlyinapossiblehighdimensionalspaethrough

evaluations of inner-produts. This is also known as the kernel trik. The advantage

of operating in high dimensional spaes is that the probability that the data is linearly

separable inreaseswiththe numberofdimensions weoperatein[2℄. Kernelmethodshave

been used suessfully in various elds of mahine learning, and inlude algorithms suh

assupportvetormahines(SVMs), kernelFisherdisriminantanalysis (KFD)and kernel

prinipalomponent analysis (KPCA) [17℄.

The anity matrixof adata set is alulatedwith the pairwiseinner-produts ofdata

samples with a kernel funtion, Element

(i, j)

of this matrix ontains the inner-produt between data sample

i

and

j

, omputedwith akernel funtion. The inner-produtsof the anity matrix may be weighted suh that eah data point is multipliedwith the inverse

overallprobabilitydensityfuntionatthatpoint. Thisanitymatrixwithweightedinner-

produts is referred to as the Laplaian matrix. Spetral methods based on the spetral

properties, i.e. the eigenvetors and eigenvalues of the anity matrix or the Laplaian

matrix, havebeen popular inreent lustering appliations [5℄,[9℄.

It has been shown [13℄ and [9℄, that when using the non-parametri Parzen window

density estimatorwitha Mererkernel funtiontoestimatethe Renyi'squadrati entropy

for adata set, the result may be interpreted in termsof a mean vetorin a Merer kernel

feature spae. By measuring distanes between lass mean vetors in a Merer kernel

feature spae we an reate dierent information theoreti divergene measures between

lass distributions. TheCS divergene measure isshownin[13℄toberelatedtomeasuring

angles between lass mean vetors in a Merer kernel feature spae and have interesting

onnetionstograph theory. The ISEdivergene measure isina similarmannershown to

be relatedto the Eulidean distane between mean vetors ina Merer kernel spae [9℄.

The link between ITL and Merer kernel methods has been used to develop reent

lassier [12℄, [9℄, and lustering [9℄, [11℄, [10℄ algorithms based on the CS divergene

measure. The lassier proposed in [12℄ works impliitlyina Merer kernel feature spae,

whiletheCSbasedspetrallusteringalgorithmsworkinapproximateMererkernelspaes

spanned by the prinipal eigenvetors of the Laplaian matrix. These methods have all

usedweightedinner-produtkernelfuntionswhenalulatingtheCSdivergenemeasure.

This thesis is inspired by the use of the CS divergene measure in both lassiation

and lustering appliations. We provide neessary bakground on information theoreti

learning onepts to understand newly disovered, important relations between Merer

kernel theory, information theoreti measures and density estimation. In partiular we

fousonthepropertiesoftheISEinformationtheoretidivergenemeasureanduseMerer

kernel properties and geometri properties of this measure to argue that it may be used

as a ost funtion in an information theoreti lassier. Previous information theoreti

lassiers havetoour knowledgeonlybeenimplemented usingthe weighted CS divergene

measure. Thus we aim to use the ISE divergene measure in a similar manner. We

investigate performane and properties of lassiers based on the ISE divergene using

(13)

through evaluations with Merer kernels.

WealsoinvestigatespetralversionsoftheISElassiers,whereweeigendeomposethe

Laplaiandata matrix or the anity matrix. None of the ISE divergene based lassiers

presented inthis thesis havebeen presented before, sowe ouldnot know howthey would

perform. Wehoose toompare theresultsobtainedwithourimplementationoftheBayes

lassier, both beause it is a well known lassier and beause we nd out that the ISE

based lassierrule may beexpressed in avery similarway tothe Bayes lassierrule.

We present two dierent versions of the ISE lassier operating impliitly in dierent

Mererspaes. ThestandardISEandtheLaplaianISElassier,operatingonunweighted

and weighted trainingdata, respetively. We alsodevelop spetral versions of these las-

siers, the spetral ISE lassier and the spetral Laplaian ISE lassier, working in

the approximated Merer spaes spanned by the prinipal eigenvetors of the anity and

Laplaianmatries, respetively. The ISE based lassiers working impliitly ina Merer

kernelspaeingeneralseemstogivebestresults. ForsomedatasetstheLaplaianindued

weights improve the lassiation rates, but in others it redues or does not hange the

lassiation rates signiantly.

1.0.1 Quik summary of ontent in this thesis.

Weprovideneessarybakgroundinformationabouttherelativelynewoneptsused ininformationtheoreti learning. We alsoreview bakground informationneessary

tounderstand basi density estimation and patternlassiation.

We introdueand investigatenew lassiers based on the informationtheoreti ISE divergene measure and kernel methods, using both weighted and unweighted data.

We investigate relations between an ISE divergene based lassier operating im- pliitly in a Merer kernel spae and the well known Parzen window based Bayes

lassier. Wendthat usingunweighted data theISE lassierisomparabletothe

Bayes lassierwith slightly dierent properties.

We use the spetral properties of the anity and Laplaianmatries of the data, to

proposeandinvestigateISEbasedlassiersworkingdiretlyinapproximatedMerer

kernel spaes. We note that in most ases the spetral versions of the ISE lassier

perform slightly worse than the versions working impliitly inMerer spaes.

(14)

1.1 Denitions and notation used

x,y A training patternand lass label

i,N Counter and number of patterns

i.i.d Independent identiallydistributed

pdf Probability density funtion

ISE Integrated Squared error

MISE Mean Integrated Squared Error

AMISE Asymptoti Mean Integrated Squared Error

IP Information Potential

ITL Information Theoreti Learning

W σ 2 ( · , · )

A Gaussian kernel, with bandwidth

σ 2 K h ( · , · )

A general Merer kernel, with bandwidth

h F

A Merer kernel feature spae

d

Dimensionality of data

C

Numberof lasses

Table 1.1: Denitions and abbreviations

Density funtions are usually referened with small letters e.g.

p(x)

. Probabilities are referened with large letters e.g.

P (x)

. Integrals with no limits are assumed to be with

lowerlimit

−∞

andupperlimit

. Thesample

x

maytakeany valueinthe

d

dimensional spae, unless otherwise speied. Expetations with regard to a variable or funtion

f

is

denoted by

E f {·}

if it may beunlear what we alulatethe expeted value with respet

to.

1.2 Struture and literature

Struture of this thesis This thesisis divided inthree parts. In Part I wepresent the

theory neessary to understand and implement our lassiers. Part II ontains analysis

and experiments done to hek how the theory works onsome populardata sets. In Part

III we onlude the thesis and suggests further work that may be done.

Chapter 2introdue thebasi onepts ofpatternlassiation. TheBayeslassier is used as an example and shown to give a minimum probability of lassiation

error.

Chapter 3disusses variousmethodsfor density estimationwith anemphasis onthe

Parzen windowdensity estimatorand itsproperties.

Chapter 4 explains the onept of an information theoreti learning mahine. A detailed disussion of various information theoreti riteria is given. Some sample

(15)

Chapter 5 explain the kernel trik and how it an be used to express information theoreti measures inaMererkernel spaefromasimple geometriviewpoint. The

standard ISE lassieris developed using this geometri view of the ISE divergene

between two distributions. This lassier is shown in a speial ase to be equal to

the familiarBayeslassierusing density estimates. Somerelationsbetween theISE

divergene measure and graph theory are disussed.

Chapter 6 presents the Laplaian ISE lassier, a modied version of the standard

ISE lassierusing weighted data samples.

Chapter 7 presents spetral versions of the standard ISE and Laplaian lassier,

working in approximated Merer kernel spaes.

Chapter 8 provides a short analysis of the eet of dierent kernel sizes and kernel

typesused in the previously presented lassiers.

Chapter 9presents and disusses resultsfound using thedierentlassiers onsome

popular datasets. Wetry toillustratethe eets of weighting the data samplesand

howdata distributes inthe approximate Merer kernel spaes.

Chapter 10 onludes this thesis and suggests some work that may be done in the

future.

The appendix ontains some additionallassiationresults not listedin Chapter 9.

Literature Information theory in general is overed by [3℄ and the artile whih rst

used entropy as a measure in ommuniation, [25℄. The priniples behind information

theoretilearningwasmainlyintroduedin[18℄ withannieoverview in[5℄and[9℄. Good

introdutionbookstopatternlassiationmethodsare[29℄,[4℄and[8℄. TheCSdivergene

based lassier whih inspiredmuh of the work with our ISE divergene based lassiers

is presented in [12℄. Important relationsbetween Merer kernel funtions,Parzen window

density estimators, CS divergene and graphtheory isreviewed in[13℄. Asurvey of kernel

methods used in pattern analysis is given in [26℄ and [17℄. Spetral methodsused in this

thesis are inuened by materialin[26℄ [9℄, [24℄, [6℄and [31℄.

(16)
(17)

Theory

(18)
(19)

Pattern lassiation basis

This hapter gives a denition of pattern lassiation. It also reviews the well known

Bayes lassiation rule, whih is shown to give a minimum probability of lassiation

error. TheBayeslassiermaythus beused asabenhmark whentesting otherlassiers

withregardtoerrorrates. InPartII,Chapter 9,weuseanimplementationofthislassier

toompare withthe new lassiers whih willbepresented inlater haptersof thisthesis.

Denition The task of lassiation is to nd a rule, whih based on observations of

trainingpatterns assigns anunlassied pattern toone of several possible lasses. A las-

siationrule with two dierent lasses isto estimatea funtion

g : R d −→ {− 1, 1 } ,

usinginput-outputtrainingdatapairsgeneratedi.i.daordingtoanunknown probability

distribution

p(x, y ), (x 1 , y 1 ), . . . , (x N , y N ) ∈ R

d

× Y, Y {− 1, 1 }

suhthat

g

willorretlylassifya new sample

x

[17℄. A sample

x

isassignedtothe lass

labeled

+1

if

g(x) ≥ 0

. Thesample

x

isassumed tobegeneratedfromthesamepdf

p(x, y)

as the trainingdata.

2.1 The Bayes lassier

To explainmore indetail howwean dene alassierwe begin withthe Bayeslassier.

The reason for starting with this is that it an be shown to give an optimal result with

regard to minimum probability of lassiation error, under ertain onditions. This will

be proved in the next setion. It isalso well known and has some theoretial onnetions

tothe ISE lassier, whih willbeexplained inSetion 5.3.1.

Wewanttolassifyanunknownfeaturevetor

x

tooneof

C

possiblelasses

ω 1 , . . . , ω C

in a way that assigns

x

to the lass where it'smost likely to belong. We dene what is

(20)

most likely with the probabilities

P (ω i | x), i = 1, . . . , C

, also known as the a posteriori

probabilities. A possible lassiation rule isto assign

x

to the lass

ω

satisfying[29℄

ω = max

ω i

P (ω i | x), i = 1, . . . , C.

(2.1)

Assume that the a priori probabilities

P (ω i ) i = 1, . . . , C

, are known. If they are

unknown they anbeestimated fromour trainingdata as

P (ω i ) = n N i

where

N

isthe total

number of training samples, and

n i

is the number of samples belonging to lass

ω i

. The

lass-onditional probability density funtions

p(x | ω i )

, also known as likelihood funtions

of

ω i

, with respet to

x

, are alsoassumed tobe known [29℄. Ifthese are unknown we will

see later how they an be estimated.

1

To alulate the a posteriori probabilities we an

use Bayesrule [29℄

P (ω i | x) = p(x | ω i )P (ω i )

p(x) , i = 1, . . . , C,

(2.2)

where

p(x) = X

i

p(x | ω i )P (ω i ).

Sine

p(x)

isa ommonfator forall lasses itmay be ignored and wean use

ω = max

ω i

p(x | ω i )P (ω i ), i = 1, . . . , C,

(2.3)

as our lassier. If the apriori probabilities

P (ω i )

, are equal Eq.(2.3) redues to

ω = max

ω i

p(x | ω i ), i = 1, . . . , C,

andthesearhforthemostprobablelassforfeature

x

reduestoevaluatingtheonditional

pdfs at

x

. Sine the lassier inEq. (2.3) is obtained using Bayes rule, itis oftenreferred

toas the Bayeslassier. Inthe two lass ase we use Eq. (2.3) to assign

x

to lass

ω 1

if

p(x | ω 1 )P (ω 1 ) > p(x | ω 2 )P (ω 2 ).

Our lassiation funtionan nowbe dened asto lassify

x

tolass

ω 1

if

g(x) = p(x | ω 1 )P (ω 1 ) − p(x | ω 2 )P (ω 2 ) ≥ 0

2.2 Minimum probability of lassiation error for the

Bayes lassier

InFig.2.1,theBayesianlassierfortwoequiprobablelassesforaone-dimensionalfeature

vetor

x

is illustrated. The region to the left of the dotted threshold linelearly ontains most of

p(x, ω 1 ) = p(x | ω 1 )P (ω 1 )

and we dene this as

R 1

, and the region to the right

1

If

x

isdisretethelikelihood funtionsbeomeprobabilitiesandaredenoted with

P ( x| ω i )

(21)

Figure 2.1: Illustration of two deision regions. Figure borrowed from [29℄

of the threshold as

R 2

. Let

R 1

and

R 2

be the regions where we lassify

x

to

ω 1

and

ω 2

,

respetively. The total area overed by

p(x | ω 1 )

inregion

R 2

and

p(x | ω 2 )

in

R 1

willbethe

probabilityofausinglassiationerrors. Ifthethresholdismovedtotheleftorright,this

areaand probabilitywillinrease. Thismeans thatif wewant tominimizethe probability

of an error, the deision regions

R 1

and

R 2

must be seleted by moving the threshold so

this area is assmallas possible.

In a multilass situationwith a multidimensionalfeature vetor

x

we have

C

dierent

lasses with deision regions

(R 1 , . . . R C )

our feature vetor

x

an be plaed in. We now

generalize the situation in Fig. 2.1. Writing the probability of a orret deision by the

joint probability

P (x ∈ R i , ω i ),

then the probability of erroneously assigning

x

to

ω j

by not seletingthe orret lass

ω i

is

P (x ∈ R j , ω i ), ∀ j 6 = i.

(22)

The total probability for ommittingan error inlassiation isthus

P e = [

∀j6=i

P (x ∈ R j , ω i ),

= X

∀j6=i

P (x ∈ R j | ω i )P (ω i ),

= X

∀j6=i

P (ω i ) Z

R j

p(x | ω i )dx,

= X

∀j6=i

Z

R j

P (ω i | x)p(x)dx.

The total probability for orretlassiation is

X

i

Z

R i

P (ω i | x)p(x)dx = 1 − P e .

Hene

P e = 1 − X

∀i

Z

R i

P (ω i | x)p(x)dx.

The error is learly minimizedwhen the regions

R i

are seleted in away where

R i : P (ω i | x) > P (ω j | x), ∀ j 6 = i,

whihis the same as Eq. (2.1).

(23)

Density estimation

Theostfuntionsusedbythelassiersdisussedinthisthesisarealldependentonnding

somesortofdensityestimateforthelassdistributionsofthedata. Theasewhereweknow

the distributions of the feature vetors in eah lass

ω i

, given by the likelihood funtions

p(x | ω i )

, is unfortunately not the reality for most data sets. We have to nd estimates of these distributions. There are basially two ategories of methods for estimation of

pdfs, parametri and non-parametri methods. In this setion, for ompleteness, a short

desription of parametri methods for density estimation is given. For more details [29℄

is reommended. The non-parametri methods are far more important in information

theoreti lassiation, sine we often annot assume that the data set has a parametri

distributionshape. Inpartiular theParzenwindowmethodfor density estimationwillbe

investigated. Throughoutthishapterweassumeadatasetof

N

samples,

x i , i = 1, . . . , N

,

generated i.i.dfrom unknown distributions,unless otherwise is speied.

3.1 Parametri methods

Assume that the pdf to be estimated is desribed in parametri form by some unknown

parameter vetor

θ

, so it an be written as

f (x; θ)

. We have a limited number

N

of

i.i.dtrainingdata,

x 1 , . . . , x N

availablefromour distribution. Usingthese sampleswean use dierent methods to nd an estimate of the parameters in

θ

suh that the estimate

f ˆ (x; θ)

is as lose as possible to the true pdf. This means that we assume that our data

is generated from a distribution with a shape that is lose to a parametri form, e.g we

assume a Gaussian, Rayleigh or some other well known distributionand try to adjust its

parameters to t our data as good as possible. The parameters in

θ

are usually found

by maximum likelihood estimation. In [29℄ some methods for parametri estimation are

desribed, e.g. maximum likelihoodestimation, mixture models, maximum entropy et.

(24)

3.2 Non-parametri methods

To avoid the need to make assumptions about a parametri shape of the desired distri-

bution, we must often use non-parametri methods. In this setion, we review dierent

methodsofdensityestimationwithaone-dimensionalrandomvariable

x

takenfromaon-

tinuous,univariatedensity funtion

f(x)

. Westartwiththe simplehistogrammethodand

expandituntilweendup withthe Parzenwindowestimator. Someof themostimportant

properties of the Parzen window estimator are then disussed. In the last setion, the

Parzen windowmethodis expanded to the ase where we have multivariate distributions,

where the variable

x

is amultidimensionalvetor.

3.2.1 The histogram density estimator

The oldest way to nd a non-parametri estimate of a funtion is given by the histogram

[30℄, [9℄. Given an origin

x 0

, and a bin width

h

, the bins of the histogram are dened as

[x 0 + mh, x 0 + (m + 1)h)

for positive and negative integers

m

. The histogramestimateof

the funtion

f (x)

is then

f ˆ (x) = 1

Nh (

noof

x i

insame bin as

x).

(3.1)

Thisestimatorisobviouslydisontinuousand notusablewhen weneedtondderivatives.

3.2.2 The naive density estimator

This is alsoa variantof the histogrammethod. Dene the pdf evaluatedat

x

as

f (x) = lim

h→0

1

2h P (x − h < X < x + h)

(3.2)

The probability

P (x − h < X < x + h)

an be estimated by ounting the numberof data

samplesfallingintoabinof size

2h

entered at

x

. This anbedened morepreiselywith

a weight funtion

W (x) = 1

2

if

| x | ≤ 1 0

otherwise

,

suh that the naive estimator an beexpressed as [9℄

f ˆ (x) = 1 Nh

X N i=1

W

x − x i

h

.

(3.3)

Introduinga resalingnotation

W h (u) = h −1 W (u/h)

werewrite Eq. (3.3) as

f(x) = ˆ 1

N X N

i=1

W h (x − x i ).

(3.4)

From Eq.(3.3) wesee that anestimateforthe pdf at

x

is given by plaingabox around

eah sample

x i

with width

2h

and height

(2Nh) −1

and sum up. This estimator is not

(25)

3.2.3 The Parzen window density estimator

Parzen generalized the weight funtion

W h ( · )

in Eq. (3.4) toa kernel funtion or Parzen

window whihis a funtionsatisfying [29℄

K h (x) ≥ 0

and

Z

x

K h (x)dx = 1.

Thesubsript

h

refers tothebandwidth orwindowwidth of thekernel[30℄. Usually

K( · )

is

hosen tobe aunimodalprobability density funtion that issymmetriaroundzero. This

makes sure that the estimator

f(x) = ˆ 1 N

X N i=1

K h (x − x i )

(3.5)

produes anestimatewhih isalsoa density.

Toinvestigatesomeofitsproperties,wendexpressionsforthemeanvalueandvariane

of Eq. (3.5). Let

f(x) ˆ

be the estimate of the true density

f (x)

at

x

, with

x

a random

variable with density

f (x)

. Then

E { f ˆ (x) } =E { K h (x − x ) }

= Z

K h (x − x )f(x )dx

=K h (x) ⋆ f (x).

(3.6)

The density estimate is therefore a smoothed version of the true density. The bias of the

estimatoris given by [30℄

E { f ˆ (x) } − f (x) = [K h (x) ⋆ f (x)] − f (x),

(3.7)

and the variane is [30℄

V ar { f ˆ (x) } =E { [ ˆ f(x) − E { f ˆ (x) } ] 2 }

= 1

N { (K h 2 (x) ⋆ f (x)) − [(K h (x) ⋆ f (x)] 2 } .

(3.8)

It is ommonto measure the loseness of the estimator

f ˆ (x)

tothe target density

f(x)

in

the point

x

by the size of the mean squared error (MSE)

MSE { f(x) ˆ } = E n

[ ˆ f (x) − f(x)] 2 o ,

whihan bewritten as

MSE { f ˆ (x) } = 1

N { (K h (x) 2 ⋆ f (x)) − [(K h (x) ⋆ f (x))] 2 } + { (K h (x) ⋆ f (x)) − f(x) } 2

=V ar { f(x) ˆ } + [Bias { f ˆ (x) } ] 2 .

(26)

Instead of just estimating the funtion

f (x)

at a single point we want to estimate it over

the whole

x

-spae. The mean integrated error (MISE) is a more appropriate measure for analyzing

f ˆ (x)

where

MISE { f(x) ˆ } = Z

MSE { f ˆ (x) } dx

= Z

E { [ ˆ f(x) − f (x)] 2 } dx + Z

V ar { f ˆ (x) } dx.

(3.9)

The biasand varianeterm inEq.(3.9)depend onthe kernel width

h

indierentways. It

has been shown that Eq. (3.9) for large sample sizes

N

, the asymptoti mean integrated

squared error (AMISE) isgiven by [14℄,[30℄

AMISE { f(x) ˆ } = (Nh) −1 R(K ) + 1

4 h 4 µ 2 (K) 2 R f ′′

(3.10)

where

µ 2 (K) = R

z 2 K(z)dz

,

R(f ′′ ) = R

{ f ′′ (x) } 2 dx

,

f ′′ (x) = d d 2 2 x f (x)

and

R(K) = R K(z) 2 dz

. Weseethatminimizingtheleftterm(thevariane)withalargekernel window

h

results in a huge inrease in the bias part whih is proportional to

h 4

. This is what is

known as the variane-bias trade-o in kernel size seletion. There exists many ways to

nd the kernel size

h

. We mention two popular methods here. The rst dierentiates Eq. (3.10) and equates itto zero, obtaining

h AM ISE =

R(K) µ 2 2 (K)R(f ′′ )N

1 5

The other method estimates

R(f ′′ )

by assuming that the true underlying density is a

normal density. Then the kernel size is given by [9℄

h AM ISE = 1.06N 1 5

Several other methodsexist, see [14℄ and [30℄.

3.2.4 The multivariate Parzen window

TheextensionoftheParzenwindowtofeaturedataina

d

-dimensionalspaeisalittlemore diult. The sparseness of data in higher dimensional spaes makes the estimation more

diult, unless we have very many samples. This phenomenon is usually referred to as

the urse of dimensionality. Rememberingthat the kernel funtioninthe one-dimensional

ase speify the window width,this window willinthe multidimensionalase be replaed

with hyperubesand eah dimensionof the ube requiresa parameter tobe estimated for

the kernel. A diret extension of the univariate kernel estimate in Eq. (3.5), is obtained

by replaing the point

x

with a vetor-point

x ∈ R d

and the variable

x i

with a

d

-variate

sample

x i

with density

f(x)

. The Parzen estimatorbeomes [30℄

f(x) = ˆ 1 N

X N i=1

K H (x − x i )

(3.11)

(27)

where

H

is a symmetripositivedenite

d × d

matrix alled the bandwidthmatrix

K H (x) = | H | −1/2 K(H −1/2 x).

With further restritions on

H

, see [30℄ for details, we get the single bandwidth kernel

estimator

f(x) = ˆ 1 Nh −d

X N i=1

K { (x − x i )/h } .

(3.12)

There exists several methodstogivean estimateofthe optimalkernelsize for amulti-

variatedataset. The optimalkernelsizeisusuallyseleted tominimizetheMISE between

f ˆ (x)

and the target density

f (x)

. The normal referene rule for the MISE kernel size is

given by Silverman'srule [28℄, [9℄

ˆ h = σ x

4 (2d + 1)N

d+4 1

,

(3.13)

where

σ x 2 = d −1 P

i Σ ii

and

Σ ii

are the diagonalelementsof the sample ovarianematrix.

Due to the urse of dimensionality this method is not regarded as reliable for higher di-

mensionaldata. In this thesis most of the data sets are of higher dimensions, so we have

hosen a ross validationtehnique to nd the best kernel sizes inthe density estimates.

(28)

Some well-known Parzen windows with

u = x−x h i

are listed below, where for

u ≥ 1

all

windows evaluate to zero, exept the Gaussian kernel. Forsimpliity we only present the

one-dimensional versions,but they an easilybeextended to multivariateversions.

Uniform

K h (u) = 1/2.

Epanehnikov

K h (u) = 3

4 (1 − u 2 ).

Gaussian

W σ 2 (x) = 1

√ 2πσ 2 exp (

− (x − x i ) 22

) .

Quarti

K h (u) = 15

16 (1 − u 2 ) 2 .

Triweight

K h (u) = 35

32 (1 − u 2 ) 3 .

Cosinus

K h (u) = π

4 cos π 4 u

.

(29)

Information theoreti learning priniples

This hapter starts with a brief introdution of information theory. An overview of the

onepts inaninformationtheoreti learningmahineisthen given. Informationtheoreti

riteria,whihgivesusthetoolstomeasuretheshapeof,anddistanebetween,probability

distributions,is then explainedindetail. Inthe lastsetion wedisuss some ost funtion

estimators used in ITL.

4.1 Information theory

Information theory is in this thesis related to C.E Shannon's report from 1948, A math-

ematial theory of ommuniation [25℄. Shannon dened a measure of information or

unertainty assoiated with a stohasti experiment and named it entropy. This measure

was used toanswer importantquestionsin ommuniation. Shannon used entropy tond

alimittohowmuhinformationanbetransferred overanoisyhannel,and tond ways

todesign optimalodes for data ompression.

Entropy an bethoughtof asthe unertaintyassoiated withthe value ofa realization

ofasingle randomvariable. Itisameasure onhowmuhinformationthatisgainedabout

the ontent of astohasti random variable aftera stohasti experiment.

Shannon also dened a measure alled mutual information, whih is the amount of

information that one random variable arries about another, i.e. the redution in the

unertaintyofone randomvariableduetotheknowledgeofthe other. Mutualinformation

is a speial ase of a more general quantity, alled relative entropy. The relative entropy

or divergene an be used as a measure of distane between two distributions. It is a

measure of the ineieny of assuming that a distribution is given by a density funtion

q( · )

,whenitinfathas adistributiongiven byadensityfuntion

p( · )

. Thisisalsorefereed

toas adivergene measure between distributions. In this thesis we use dierent estimates

of entropy and divergene asinformationtheoreti measures.

(30)

Figure 4.1: A general learning mahine.

4.2 An information theoreti learning mahine

A very general learningmahine may be desribed using a modellikethe one in Fig. 4.1.

Mahine learning is in general divided into supervised, semi-supervised and unsupervised

tasks. The model in Fig. 4.1 an be used to desribe all of the above mahine learning

tasks. Wehave some input data

X

,ontaining informationor measurements from areal- world event. We want the learning mahine to perform some spei task on

X

. This is

done by givingthe inputdata

X

to apossibly non-linear parametri mappingfuntion

g : R d → R M ,

(4.1)

whihtransforms the input vetor

X ∈ R d

to

Y ∈ R M

Y = g (X, W ),

(4.2)

where

W

aretheparametersofthemappingfuntion. Iftheoptimalityriterionisbasedon

aninformationtheoretimeasure, eitherentropyordivergene, weallthis aninformation

theoreti learningmahine. The mapperfuntionin Eq. (4.2)transform theinput data to

a new formdepending onthe task of the learningmahine. The outputof the mapper,

Y

isompared with anoptimality riterionand optionallyadesired response

z

for theinput

X

. For eah presentation of training data the optimality riterion is evaluated and the error term

e

isfed to anadaptation algorithmwhih update the parameters

W

.

Supervisedlearningonerns alearningmahinewith adesiredresponse foreahinput

X

. This thesis fous on supervised learning, speially on lassiation of data. If the

learningtask is lassiation,the desiredresponse

z

for the training data ontains a lass

(31)

label. Thetaskisthentolassifyarandominputsampletooneof

C

lasses. Themapping

funtion may inthis ase bedesribed as

g : R d → { y 1 , . . . , y c } ,

where

y 1 , . . . , y c

arethepossiblelasslabels, and

M = 1

inEq.(4.1). Regressionisanother

example of atypialsupervised learningtask, where the mappingis

g : R d → R .

Semi-supervised learningtasksonerns learningwhere thelabelsofthe inputdataare

only partially known. One example is ranking where we only have available the relative

orderingofthetheexamplesinthetrainingset,whileouraimistoenableasimilarordering

of noveldata.

Unsupervised tasksare learningtaskswhere thewanted informationfromthe datahas

tobeextrated withoutany desiredresponse inthe optimalityriterion. Clustering isone

typialexampleofunsupervisedlearning,wheretheaimistondanaturaldivisionofdata

into homogeneous groups [26℄. Anomaly and novelty detetion are other examples, where

the taskistodetetsamplesthatdeviate fromthenormal. Other importantunsupervised

tasks are nding low-dimensional representations of the input data, important examples

ofthis isprinipal omponentanalysis (PCA)and independentomponent analysis (ICA).

In PCA, the mapping inEq. (4.1) aims to projet the input

X

toa lower

M

-dimensional spae, where

M

denotes the number of unorrelated features in

X

. In ICA the goal is to

projet

X

to a lower

M

-dimensional spae where eah of the features of

X

are mutually

independent.

Traditionallythe riteria foroptimalityinFig. 4.1has been tominimizethe MSEost

funtion between the output

Y

of the mapperand the desiredoutput

z J(Y ) = E

(z − Y ) 2 .

(4.3)

Inourgeneralmahinelearningmodelwewanttotransferasmuhinformationaspossible

aboutourdataintothemapperfuntion

g(X, W )

,suhthatthismapperisabletodesribe

our data as aurately as possible. The optimality riterion is thus ritial in obtaining

the parameters

W

. If weuse MSE, the informationtransferred fromthe measurements

X

andthe desiredresponses

z

totheparameters

W

ispurelybasedonseondorderstatistis

onstraints. This is only optimal if the input data is drawn from Gaussian distributions,

whihis arather strit restrition.

To transfer as muh information as possible to the parameters

W

the error term in

Fig. 4.1 must be omputed with a riterion transferring as muh information as possible

about the input data

X

, and the desired response

z

, to the parameters

W

of the mapper

Y = g (X, W )

. If we base our alulations on informationtheory and optimize with infor- mation theoreti riteria in Fig. 4.1 and Eq. (4.2) we have what Prinipe et al desribes

as information theoreti learning [18℄. The main advantage with information theoreti

riteriaare that they are funtions of probabilitydensities and apture alldata statistis,

not justthe seond-order statistis. This givesuslearningmahineswhere the parameters

W

desribes our data ina muh better way than the traditionalMSEan.

(32)

4.3 Information theoreti quantities

In this setion we disuss some important information theoreti quantities that may be

used as informationtheoreti riteriain anITL mahine.

Denitions A disrete stohasti variable

X

is assoiated with a triple

(x, A X , P X )

,

where the outome

x

is the value of the stohasti variable whih takes on a set of

possible values

A X = { a 1 , a 2 , . . . , a N }

. These have probabilities (distribution)

P X = { p 1 , p 2 , . . . , p N }

, where

P (x = a i ) = p i

,

p i ≥ 0

and

P

a i ∈A X P (x = a i ) = 1

.

A ontinuous stohasti variable

X

is assoiated with a probability density funtion

f X (x),

wheretheoutome

x

isthevalueofthestohastivariable. Thepdfisdenedasthe

derivative of the umulativedistribution funtion (df), dened as

P (X ≤ x 0 ) = F X (x 0 )

,

where

0 ≤ F X (x) ≤ 1

. Hene,

f X (x 0 ) = x F X (x) | x = x 0

, and

R ∞

−∞ f X (x)dx = 1

.

We have statistialindependene between random variables

X 1 , . . . , X N

if and only if

f (x 1 , . . . , x N ) = Q N

i=1 f(x i )

.

A metri ona set

X

is a funtion

u : X × X → R

. For all x,y,z in

X

, this funtion is

required tosatisfy the followingonditions

1.

u(x, y) ≥ 0

.

2.

u(x, y) = 0

if and only if x=y.

3.

u(x, y) = u(y, x)

.

4.

u(x, z) = u(x, y) + u(y, z)

.

4.3.1 Entropy

All information theoreti riteria are related to the onept of entropy. We now explain

Shannon's measure of entropy and some of it's properties. A more general version of

entropy, the Renyi entropy is then reviewed.

Shannon's entropy

Assume there is some unertainty in the outome of an randomexperiment and that the

possible outomes of the experiment is given by a probability distribution. This uner-

tainty was rst quantied by Shannonas

H = H N (p 1 , p 2 , . . . , p N )

satisfyingthe following

riteria[25℄ [9℄

1.

H N (p 1 , p 2 , . . . , p N )

is asymmetri funtionof itsvariables.

As anexample,

H N (p 1 , p 2 , . . . , p N ) = H N (p 2 , p 1 , . . . , p N ).

2.

H N (p 1 , p 2 , . . . , p N )

is aontinuous funtion of

p 1 , p 2 , ...., p N

.

3.

H N ( N 1 , . . . , N 1 )

attains the maximum value.

(33)

4.

H N +1 (tp 1 , (1 − t)p 1 , p 2 , . . . , p N ) = H N (p 1 , p 2 , . . . , p N ) + p 1 H 2 (t, 1 − t)

for any distri-

bution

p X

and

0 ≤ t ≤ 1

.

The fourth property of Shannon entropy may be explained as follows [25℄. If a hoie

is broken down into two suessive hoies, the original entropy (

H

) is the weighted sum

of the individual values of

H

[25℄. This isillustrated in Fig. 4.2.

Figure 4.2: At the left we have three possibilities, eah hosen aording

to the probabilities

p 1 = 1 2

,

p 2 = 1 3

,

p 3 = 1 6

. On the right, we rst hoose

between two possibilities eah with probability

1

2

. If the seond possibility is hosen, we make another hoie with probabilities

2 3

and

1

3

. The nal

results have the same probabilities as before. We require, in this speial

ase, that

H( 1 2 , 1 3 , 1 6 ) = H( 1 2 , 1 2 ) + 1 2 H( 2 3 , 1 3 ).

The

1 2

oeient is beause

this seond hoie only ours half the time.

Shannon showed that the only

H

satisfyingthe above assumptions is[25℄

H N (p 1 , p 2 , . . . , p N ) = H N ( P X ) = − K X

p i ∈P X

p i log b p i ,

(4.4)

withtheonventionthat

0 log b 0 = 0

. Thismeasurehealledentropy,beauseitisthesame

expression used to deneentropy in statistialmehanis.

K

is someonstant, depending

of the units of the sample data. If the base

b = 2

, the entropy unit isbits and if

b = e

the

unitisnats. Inthis thesisweleavethebase

b

ofthelogarithmunspeied,sineitisjusta

measurement sale. Entropy isusually denoted by

H(X)

where

X

is alabel fora random

variable, and not the argument of afuntion. Shannon's entropy depend onthe quantity

I(p i ) = − log p i ,

(4.5)

(34)

whihwas proposed by Hartley asa measure of the informationreeived by learningthat

a single event of probability

p i

took plae [7℄. Hene the Shannon entropy is a weighted

average of informations

I(p i )

H(X) = E { I(p i ) } .

(4.6)

Properties of Shannon entropy

SeveralpropertiesoftheShannonentropyanbederivedbasedonthefourbasiproperties

[9℄[25℄ [3℄

1. Addingorremovinganeventwithprobabilityzerodoesnotontributetotheentropy,

H N (p 1 , p 2 , . . . , p N , 0) = H N (p 1 , p 2 , . . . , p N ).

2. It vanishes when one outome is ertain,

H N (p 1 , p 2 , . . . , p N ) = 0

,

p i = 1

,

p j = 0, j 6 = i, i = 1, . . . , N

.

3. The maximum of

H N

inreases as

N

inreases.

4.

H N ≥ 0

.

Example ToillustratesomepropertiesofShannon entropyletthe stohastivariable

X

be given by

X =

1,

with probability

p 0,

with probability

1 − p

.

The entropy inthis ase is

H(X) = − p log(p) − (1 − p) log(1 − p),

asshown inFig.4.3asafuntionof

p

. NoteinFig.4.3thatthe entropy iszerowhen

p = 0

or

p = 1

,meaningthereisnounertainty aboutthe outome ofthe stohastiexperiment.

If

p = 1 2

the unertainty is maximized, and we need on average 1 bit to transmit the

outome of the experiment.

(35)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.2 0.4 0.6 0.8 1

p

H(p)

Figure 4.3:

H(X) = H(p)

in bits with Shannon's entropy, notie that

H(X) = 1

, when

p = 1 2

.

(36)

Shannon dierential entropy

Fora ontinuous stohasti variable

X

with density

f (x)

1 the dierentialentropy

h(X)

is

dened as [25℄ [3℄,

h(X) = − Z

f(x) log f (x)dx.

(4.7)

This an alsobewritten asan expeted value

h(X) = E f {− log f(x) } .

(4.8)

Properties of Shannon's dierential entropy

1. If

X

is limited to a ertain volume

v

in spae, then

h(X)

is maximum and equal to

log v

when

f (x)

is onstant (uniformdensity funtion)in the volume.

2. Dierential entropy may be negative. If we onsider the uniform density above, for

v < 1

,

log v < 0

.

3. The normal distributionmaximizes the entropy over alldistributions with the same

ovariane. This property an be exploited to measure the non-Gaussianity of a

stohasti variable.

4. The dierentialentropy is ameasure that is relativetothe oordinatesystem. Con-

sider for example hanging oordinates by a linear transformation,

Y = MX

. In

that ase,

h(Y) = h(X) + log | det(M) | .

Renyi's entropy

Asexplained above,Shannon'sentropy isameasure ofthe average amountofinformation

ontained ina single observation of a random variable. Renyi used a more generaltheory

of mean values, where the mean of the real numbers,

x 1 , . . . , x N

, with positive weighting

p 1 , . . . , p N

, has the form[18℄ [22℄

x = ϕ −1 X N

i=1

p i ϕ(x i ),

(4.9)

where

ϕ(x)

isaKolmogorov-Nagumofuntion,whihisanarbitraryontinuousandstritly monotonifuntiondened onthereal numbers. Hefound thatageneralentropy measure

H

obeys the relation[18℄

H = ϕ −1 X N

i=1

p i ϕ(I(p i ))

!

,

(4.10)

1

(37)

where

I(p i )

is Hartley's information measure. In order to be an information measure

ϕ( · )

annot be arbitrary, sine information is additive. We have two hoies,

ϕ(x) = x

or

ϕ(x) = 2 (1−α)x

. The rst ase gives Shannon's entropy and the seond gives Renyi's

entropy of order

α

[18℄

H R α (X) = 1 1 − α log

X N k=1

p α k

!

α > 0, α 6 = 1.

(4.11)

There is a well known relation between Shannon's and Renyi's entropy. Let

H S

denote

Shannon's entropy, then [18℄

H R α ≥ H s ≥ H R β

if

0 < α < 1

and

β > 1,

α→1 lim H R α = H S .

Renyi's and Shannon's entropies an also be related to eah other in another way. If we

onsiderthe probabilitymass funtion

P = (p 1 , p 2 , . . . , p N )

asapointin theN-dimensional spae, this point will always be in the rst quadrant of a N-dimensional hyperplane with

eahaxis interseting the oordinateone. The distane of the point

P

tothe originis the

α

root of

d α = X N k=1

p α k = || P || α

and the

α

root of

d α

is alled the

α

-norm of the probability mass funtion [18℄. Renyi's entropy satises all of Shannon's riteria in 4.3.1 on page 23. Exept of the fourth prop-

erty. The Renyi's entropy of order

α = 2

, is denoted by Renyi's quadrati entropy and

orresponds to the 2-normof the probabilitymass funtion.

If werepeat the exampleon page24with Renyi'squadrati entropy measure, we get a

similar shape inFig. 4.4, as inShannon's entropy in Fig. 4.3onpage 25.

(38)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.2 0.4 0.6 0.8 1

p

H R 2 (p)

Figure 4.4:

H(X) = H(p)

in bits with Renyi's quadrati entropy measure, notie that

H(X) = 1

when

p = 1 2

.

(39)

Renyi's dierential entropy

For the ontinuous random variable

X

with pdf

f (x)

we obtain the dierential version of

Renyi's entropy [18℄

h R α (X) = 1 1 − α log

Z

f α (x)dx,

(4.12)

= 1

1 − α log E f { f 1−α (x) } .

(4.13)

Ifweset

α = 2

, we get the dierentialRenyi quadrati entropy

h R 2 (X) = − log

Z

f 2 (x)dx,

(4.14)

= − log E f { f(x) } .

(4.15)

Some properties of the Renyi entropy of order

α

are the following[9℄

1. Just asfor Shannonentropy, theRenyi entropyis maximizedfora uniformdistribu-

tion for randomvariables with nite support.

2. The Renyi entropy is not in general maximized by the Gaussian distribution in the

xed varianease.

3. The Renyi entropy is invariantto rotationsand translations.

4.3.2 Divergene

This setionreviews some of the mostommonmeasures ofdivergene orrelative entropy

used in information theoreti learning. Divergene is used as a measure of statistial

similarity, and one an think of it as a generalization of algebrai distane measures to

probability spaes [5℄.

Kullbak-Leibler divergene

This measure disriminates two probability density distributions

p(x)

and

q(x)

, and it is

alsoreferred to asrelative entropy

D KL { p, q } = Z

p(x) log p(x) q(x) dx,

= E p

log p(x) q(x)

.

(4.16)

Some properties of this measure are [9℄

Referanser

RELATERTE DOKUMENTER

In addition we have also estimated the mean vector and covariance matrix using “full” median, the standard/classical maximum likelihood estimate, and as well as two robust

In order to perform reasoning the behaviour models shall have access to data about the simulated environment and react to events in the simulated environment, where the

This essay considers different viewpoints on the challenges of fusing and coordinating Media Operations / Public Affairs (PA) and Information Operations (IO), and discusses the

The first two domains (communications and core enterprise services) are generic technical enablers for the user-facing applications (land applications, modeling and

In order to obtain more information on the loss of organic material a n d bound nutrients from the euphotic zone in Linddspollene, w e measured sedimentation rates of total

When the focus ceases to be comprehensive health care to the whole population living within an area and becomes instead risk allocation to individuals, members, enrollees or

Using a Bayesian learning model, I estimate and compare the relative effects of prior beliefs and new information on party and leader evaluations, and the effect of partisan bias

In this section we give an overview of different refinement oracles used in the hierarchical radiosity setting 1 , and also some information theory tools for scene discretisation