Master's Thesis in Eletrial Engineering
Information Theoreti Learning for
Pattern Classiation
by
Ola Kvisle Storås
Deember, 2007
FACULTY OF SCIENCE
Department of Physis and Tehnology
University of Tromsø
Thisthesisisastudyofpatternlassiationbasedoninformationtheoretiriteria. Infor-
mationtheoreti riteriaareimportantmeasures basedonentropy anddivergene between
data distributions. First, the basi onepts of pattern lassiation with the well known
Bayes lassiation rule as a startingpointis disussed. We disuss how the Parzen win-
dow estimator may be used to nd good density estimates. The Parzen window density
estimator an be used to estimate ost funtions based on information theoreti riteria.
Furthermore,we explain amodelof aninformationtheoreti learningmahine. Withost
funtionsbased oninformationtheoreti riteria,weargue thatalearningmahinepoten-
tiallyapturesmuhmore informationaboutadata setthan thetraditionalmeansquared
error ost (MSE) funtion. We nd that there is a geometri link between information
theoreti ost funtions estimated using Parzen windowing, and mean vetors in a Mer-
er kernel feature spae. This link is used to propose and implement dierent lassiers
based on the integrated squared error (ISE) divergene measure, operating impliitly in
a Merer kernel feature spae. We also apply spetral methods to implement the same
ISE lassiers working inapproximations of Merer kernel feature spaes. We investigate
the performane of the lassiers when we weight eah data pointwith the the inverse of
the probability density funtion at that point. We nd that the ISE lassiers working
impliitly in the Merer kernel feature spae performs similar to a Parzen window based
Bayes lassier. Usinga weighted inner-produt denition gives slightly better results for
some data sets, while for other data sets the lassiation rates are slightly worse. When
omparingtheresults between theimpliitISElassierusing unweighted datapointsand
theParzenwindowBayeslassier,someoftheresultsindiatethattheISElassierfavor
the lasses with highest entropy.
I would like to thank my supervisor in this thesis, Robert Jenssen, for introduing and
explainingmanynewfasinatingideasinpatternanalysisduringtheworkwiththisthesis.
1 Introdution 1
1.0.1 Quik summaryof ontent inthis thesis. . . 3
1.1 Denitionsand notation used . . . 4
1.2 Struture and literature . . . 4
I Theory 7 2 Pattern lassiation basis 9 2.1 The Bayes lassier . . . 9
2.2 Minimum probabilityof lassiation error for the Bayes lassier . . . 10
3 Density estimation 13 3.1 Parametri methods . . . 13
3.2 Non-parametrimethods . . . 14
3.2.1 The histogramdensity estimator . . . 14
3.2.2 The naive density estimator . . . 14
3.2.3 The Parzen windowdensity estimator. . . 15
3.2.4 The multivariateParzen window . . . 16
4 Information theoreti learning priniples 19 4.1 Informationtheory . . . 19
4.2 Aninformation theoreti learningmahine . . . 20
4.3 Informationtheoreti quantities . . . 22
4.3.1 Entropy . . . 22
4.3.2 Divergene . . . 29
4.4 Estimationof informationtheoreti ost funtions . . . 32
4.4.1 Renyi quadratientropy estimate . . . 33
4.4.2 ISE divergene estimate . . . 34
4.4.3 Cauhy-Shwarz divergene estimate . . . 34
4.4.4 Usingnon-Gaussian kernels in estimation of ost funtions . . . 35
4.4.5 Shannon entropy estimate . . . 35
5.2 Informationmeasures in the Merer kernel spae . . . 38
5.2.1 ISE divergene . . . 39
5.2.2 Cauhy-Shwarz divergene . . . 40
5.3 The standard ISE lassier . . . 42
5.3.1 Connetionsto Parzen windowBayes lassier . . . 43
5.3.2 Multilassstandard ISE lassier . . . 44
5.3.3 UsingNon-Merer kernels . . . 45
5.3.4 Connetionto the graph ut . . . 45
6 A Laplaian ISE lassier 47 6.1 Modied ISE divergene . . . 48
6.2 Connetionto the Bayes probability of error . . . 48
6.3 Kernelspae and Laplaian matrix representation . . . 49
6.3.1 Illustrationof weights . . . 52
7 Spetral ISE lassiers 55 7.1 Mappingof data to aMerer based feature spae . . . 56
7.1.1 Illustrationof mappings to Merer spae . . . 57
7.2 Spetral versions of the ISE lassiers . . . 59
7.2.1 The spetral ISE lassier . . . 59
7.2.2 The spetral LaplaianISE lassier . . . 59
II Analysis and experiments 61 8 Kernel seletion 63 8.1 Eet of kernel bandwidth and kernel type . . . 63
9 Classiation experiments 69 9.1 Introdution . . . 69
9.2 Seletion of data sets and lassiation methods . . . 69
9.3 Standard ISE . . . 70
9.3.1 Confusionmatries Standard ISE and Bayes . . . 72
9.3.2 Summaryof lassiation results for the standard ISE lassier . . 76
9.4 LaplaianISE . . . 77
9.4.1 Summaryof lassiation results for the LaplaianISE lassier . . 84
9.5 Spetral ISE . . . 84
9.5.1 Summaryof lassiation results for the spetral ISE lassier . . . 89
9.6 Spetral LaplaianISE . . . 90
9.6.1 SummaryoflassiationresultsforthespetralLaplaianISElas-
10 Conlusion 97
10.1 Furtherwork . . . 97
Appendix 99
A Laplaian ISE lassier 99
A.1 Averageonfusion matriesLaplaian ISE lassier . . . 99
B Spetral ISE lassier 101
B.1 Averageonfusion matriesSpetral ISE . . . 102
C Spetral Laplaian ISE lassier 105
C.1 Averageonfusion matriesfor the spetralLaplaian ISElassier. . . 106
Bibliography 109
Introdution
Patterns are used to desribe any relations, regularities or struture inherent in a data
set generated by a soure. Similar patterns are often grouped into a lass. By deteting
signiant patterns in the available data, a learning mahine an make preditions about
new data oming fromthe same soure [26℄. In patternlassiation we use some labeled
trainingdataset, whereeahlabelrepresent alass,and usealearningmahinetopredit
the orret lass label for a new unlabeled sample. A learning mahine may be viewed
as a devie that adjust a set of parameters through a learning proess. For eah new set
of training data given to the mahine, the parameters are updated using a riteria that
aptures thewanted informationtodesribethedata inanewform. Patternlassiation
isone ofthe fundamentalproblemsinmahine learningand signalproessing[29℄, [26℄,[4℄.
Itisanimportantpartinomputervision,medialimaging,optialharater reognition,
geostatistis, handwriting reognition, biometriidentiation, natural language proess-
ing,doumentlassiation,emailspamdetetionandreditsoring,tolistafewexamples
[29℄, [4℄, [23℄,[1℄, [15℄, [8℄.
Information theoreti learning (ITL) methods emerged in [18℄ and [5℄. Information
theoreti learning here refers to the use of a general learning mahine as we desribe
in Setion 4.2, where a riteria related to Renyi's quadrati entropy is used to update
the learning proess. The riteria often use Renyi's quadrati entropy to nd divergene
measures between data in dierent distributions. A divergene measure an be thought
of as a generalization of algebrai distane measures (suh as the Eulidean norm) to
probability distributionspaes [5℄. In [18℄ two important informationtheoreti divergene
measures were presented, one is based on the Cauhy-Shwarz (CS) inequality and the
other onthe integrated squared error (ISE)between two probability distributions.
The Renyi's quadrati entropy is estimated using a Parzen window density estimation
method, and may be thought of as a generalization of variane to proesses with non-
Gaussiandistributions[5℄. In[18℄and [5℄informationtheoreti oneptsare explainedand
used in time series predition, independent omponent analysis (ICA), feature extration
and blind soure deonvolution.
Independent of ITL, a number of kernel methods have emerged in the reent years.
Kernel methods generally solve mahine learning problems in two parts: A module, also
known as a kernel funtion, performs a mapping of data to a new feature spae. In this
newfeaturespaealearningalgorithmisusedtodisoverlinearpatterns [26℄. Theuse ofa
kernelfuntionallows ustooperateimpliitlyinapossiblehighdimensionalspaethrough
evaluations of inner-produts. This is also known as the kernel trik. The advantage
of operating in high dimensional spaes is that the probability that the data is linearly
separable inreaseswiththe numberofdimensions weoperatein[2℄. Kernelmethodshave
been used suessfully in various elds of mahine learning, and inlude algorithms suh
assupportvetormahines(SVMs), kernelFisherdisriminantanalysis (KFD)and kernel
prinipalomponent analysis (KPCA) [17℄.
The anity matrixof adata set is alulatedwith the pairwiseinner-produts ofdata
samples with a kernel funtion, Element
(i, j)
of this matrix ontains the inner-produt between data samplei
andj
, omputedwith akernel funtion. The inner-produtsof the anity matrix may be weighted suh that eah data point is multipliedwith the inverseoverallprobabilitydensityfuntionatthatpoint. Thisanitymatrixwithweightedinner-
produts is referred to as the Laplaian matrix. Spetral methods based on the spetral
properties, i.e. the eigenvetors and eigenvalues of the anity matrix or the Laplaian
matrix, havebeen popular inreent lustering appliations [5℄,[9℄.
It has been shown [13℄ and [9℄, that when using the non-parametri Parzen window
density estimatorwitha Mererkernel funtiontoestimatethe Renyi'squadrati entropy
for adata set, the result may be interpreted in termsof a mean vetorin a Merer kernel
feature spae. By measuring distanes between lass mean vetors in a Merer kernel
feature spae we an reate dierent information theoreti divergene measures between
lass distributions. TheCS divergene measure isshownin[13℄toberelatedtomeasuring
angles between lass mean vetors in a Merer kernel feature spae and have interesting
onnetionstograph theory. The ISEdivergene measure isina similarmannershown to
be relatedto the Eulidean distane between mean vetors ina Merer kernel spae [9℄.
The link between ITL and Merer kernel methods has been used to develop reent
lassier [12℄, [9℄, and lustering [9℄, [11℄, [10℄ algorithms based on the CS divergene
measure. The lassier proposed in [12℄ works impliitlyina Merer kernel feature spae,
whiletheCSbasedspetrallusteringalgorithmsworkinapproximateMererkernelspaes
spanned by the prinipal eigenvetors of the Laplaian matrix. These methods have all
usedweightedinner-produtkernelfuntionswhenalulatingtheCSdivergenemeasure.
This thesis is inspired by the use of the CS divergene measure in both lassiation
and lustering appliations. We provide neessary bakground on information theoreti
learning onepts to understand newly disovered, important relations between Merer
kernel theory, information theoreti measures and density estimation. In partiular we
fousonthepropertiesoftheISEinformationtheoretidivergenemeasureanduseMerer
kernel properties and geometri properties of this measure to argue that it may be used
as a ost funtion in an information theoreti lassier. Previous information theoreti
lassiers havetoour knowledgeonlybeenimplemented usingthe weighted CS divergene
measure. Thus we aim to use the ISE divergene measure in a similar manner. We
investigate performane and properties of lassiers based on the ISE divergene using
through evaluations with Merer kernels.
WealsoinvestigatespetralversionsoftheISElassiers,whereweeigendeomposethe
Laplaiandata matrix or the anity matrix. None of the ISE divergene based lassiers
presented inthis thesis havebeen presented before, sowe ouldnot know howthey would
perform. Wehoose toompare theresultsobtainedwithourimplementationoftheBayes
lassier, both beause it is a well known lassier and beause we nd out that the ISE
based lassierrule may beexpressed in avery similarway tothe Bayes lassierrule.
We present two dierent versions of the ISE lassier operating impliitly in dierent
Mererspaes. ThestandardISEandtheLaplaianISElassier,operatingonunweighted
and weighted trainingdata, respetively. We alsodevelop spetral versions of these las-
siers, the spetral ISE lassier and the spetral Laplaian ISE lassier, working in
the approximated Merer spaes spanned by the prinipal eigenvetors of the anity and
Laplaianmatries, respetively. The ISE based lassiers working impliitly ina Merer
kernelspaeingeneralseemstogivebestresults. ForsomedatasetstheLaplaianindued
weights improve the lassiation rates, but in others it redues or does not hange the
lassiation rates signiantly.
1.0.1 Quik summary of ontent in this thesis.
•
Weprovideneessarybakgroundinformationabouttherelativelynewoneptsused ininformationtheoreti learning. We alsoreview bakground informationneessarytounderstand basi density estimation and patternlassiation.
•
We introdueand investigatenew lassiers based on the informationtheoreti ISE divergene measure and kernel methods, using both weighted and unweighted data.•
We investigate relations between an ISE divergene based lassier operating im- pliitly in a Merer kernel spae and the well known Parzen window based Bayeslassier. Wendthat usingunweighted data theISE lassierisomparabletothe
Bayes lassierwith slightly dierent properties.
•
We use the spetral properties of the anity and Laplaianmatries of the data, toproposeandinvestigateISEbasedlassiersworkingdiretlyinapproximatedMerer
kernel spaes. We note that in most ases the spetral versions of the ISE lassier
perform slightly worse than the versions working impliitly inMerer spaes.
1.1 Denitions and notation used
x,y A training patternand lass label
i,N Counter and number of patterns
i.i.d Independent identiallydistributed
pdf Probability density funtion
ISE Integrated Squared error
MISE Mean Integrated Squared Error
AMISE Asymptoti Mean Integrated Squared Error
IP Information Potential
ITL Information Theoreti Learning
W σ 2 ( · , · )
A Gaussian kernel, with bandwidthσ 2 K h ( · , · )
A general Merer kernel, with bandwidthh F
A Merer kernel feature spaed
Dimensionality of dataC
Numberof lassesTable 1.1: Denitions and abbreviations
Density funtions are usually referened with small letters e.g.
p(x)
. Probabilities are referened with large letters e.g.P (x)
. Integrals with no limits are assumed to be withlowerlimit
−∞
andupperlimit∞
. Thesamplex
maytakeany valueinthed
dimensional spae, unless otherwise speied. Expetations with regard to a variable or funtionf
isdenoted by
E f {·}
if it may beunlear what we alulatethe expeted value with respetto.
1.2 Struture and literature
Struture of this thesis This thesisis divided inthree parts. In Part I wepresent the
theory neessary to understand and implement our lassiers. Part II ontains analysis
and experiments done to hek how the theory works onsome populardata sets. In Part
III we onlude the thesis and suggests further work that may be done.
•
Chapter 2introdue thebasi onepts ofpatternlassiation. TheBayeslassier is used as an example and shown to give a minimum probability of lassiationerror.
•
Chapter 3disusses variousmethodsfor density estimationwith anemphasis ontheParzen windowdensity estimatorand itsproperties.
•
Chapter 4 explains the onept of an information theoreti learning mahine. A detailed disussion of various information theoreti riteria is given. Some sample•
Chapter 5 explain the kernel trik and how it an be used to express information theoreti measures inaMererkernel spaefromasimple geometriviewpoint. Thestandard ISE lassieris developed using this geometri view of the ISE divergene
between two distributions. This lassier is shown in a speial ase to be equal to
the familiarBayeslassierusing density estimates. Somerelationsbetween theISE
divergene measure and graph theory are disussed.
•
Chapter 6 presents the Laplaian ISE lassier, a modied version of the standardISE lassierusing weighted data samples.
•
Chapter 7 presents spetral versions of the standard ISE and Laplaian lassier,working in approximated Merer kernel spaes.
•
Chapter 8 provides a short analysis of the eet of dierent kernel sizes and kerneltypesused in the previously presented lassiers.
•
Chapter 9presents and disusses resultsfound using thedierentlassiers onsomepopular datasets. Wetry toillustratethe eets of weighting the data samplesand
howdata distributes inthe approximate Merer kernel spaes.
•
Chapter 10 onludes this thesis and suggests some work that may be done in thefuture.
•
The appendix ontains some additionallassiationresults not listedin Chapter 9.Literature Information theory in general is overed by [3℄ and the artile whih rst
used entropy as a measure in ommuniation, [25℄. The priniples behind information
theoretilearningwasmainlyintroduedin[18℄ withannieoverview in[5℄and[9℄. Good
introdutionbookstopatternlassiationmethodsare[29℄,[4℄and[8℄. TheCSdivergene
based lassier whih inspiredmuh of the work with our ISE divergene based lassiers
is presented in [12℄. Important relationsbetween Merer kernel funtions,Parzen window
density estimators, CS divergene and graphtheory isreviewed in[13℄. Asurvey of kernel
methods used in pattern analysis is given in [26℄ and [17℄. Spetral methodsused in this
thesis are inuened by materialin[26℄ [9℄, [24℄, [6℄and [31℄.
Theory
Pattern lassiation basis
This hapter gives a denition of pattern lassiation. It also reviews the well known
Bayes lassiation rule, whih is shown to give a minimum probability of lassiation
error. TheBayeslassiermaythus beused asabenhmark whentesting otherlassiers
withregardtoerrorrates. InPartII,Chapter 9,weuseanimplementationofthislassier
toompare withthe new lassiers whih willbepresented inlater haptersof thisthesis.
Denition The task of lassiation is to nd a rule, whih based on observations of
trainingpatterns assigns anunlassied pattern toone of several possible lasses. A las-
siationrule with two dierent lasses isto estimatea funtion
g : R d −→ {− 1, 1 } ,
usinginput-outputtrainingdatapairsgeneratedi.i.daordingtoanunknown probability
distribution
p(x, y ), (x 1 , y 1 ), . . . , (x N , y N ) ∈ R
d× Y, Y {− 1, 1 }
suhthat
g
willorretlylassifya new samplex
[17℄. A samplex
isassignedtothe lasslabeled
+1
ifg(x) ≥ 0
. Thesamplex
isassumed tobegeneratedfromthesamepdfp(x, y)
as the trainingdata.
2.1 The Bayes lassier
To explainmore indetail howwean dene alassierwe begin withthe Bayeslassier.
The reason for starting with this is that it an be shown to give an optimal result with
regard to minimum probability of lassiation error, under ertain onditions. This will
be proved in the next setion. It isalso well known and has some theoretial onnetions
tothe ISE lassier, whih willbeexplained inSetion 5.3.1.
Wewanttolassifyanunknownfeaturevetor
x
tooneofC
possiblelassesω 1 , . . . , ω C
in a way that assigns
x
to the lass where it'smost likely to belong. We dene what ismost likely with the probabilities
P (ω i | x), i = 1, . . . , C
, also known as the a posterioriprobabilities. A possible lassiation rule isto assign
x
to the lassω ∗ satisfying[29℄
ω ∗ = max
ω i
P (ω i | x), i = 1, . . . , C.
(2.1)Assume that the a priori probabilities
P (ω i ) i = 1, . . . , C
, are known. If they areunknown they anbeestimated fromour trainingdata as
P (ω i ) = n N i whereN
isthe total
number of training samples, and
n i is the number of samples belonging to lass ω i. The
lass-onditional probability density funtions
p(x | ω i )
, also known as likelihood funtionsof
ω i, with respet to x
, are alsoassumed tobe known [29℄. Ifthese are unknown we will
see later how they an be estimated.
1
To alulate the a posteriori probabilities we an
use Bayesrule [29℄
P (ω i | x) = p(x | ω i )P (ω i )
p(x) , i = 1, . . . , C,
(2.2)where
p(x) = X
i
p(x | ω i )P (ω i ).
Sine
p(x)
isa ommonfator forall lasses itmay be ignored and wean useω ∗ = max
ω i
p(x | ω i )P (ω i ), i = 1, . . . , C,
(2.3)as our lassier. If the apriori probabilities
P (ω i )
, are equal Eq.(2.3) redues toω ∗ = max
ω i
p(x | ω i ), i = 1, . . . , C,
andthesearhforthemostprobablelassforfeature
x
reduestoevaluatingtheonditionalpdfs at
x
. Sine the lassier inEq. (2.3) is obtained using Bayes rule, itis oftenreferredtoas the Bayeslassier. Inthe two lass ase we use Eq. (2.3) to assign
x
to lassω 1 if
p(x | ω 1 )P (ω 1 ) > p(x | ω 2 )P (ω 2 ).
Our lassiation funtionan nowbe dened asto lassify
x
tolassω 1 if
g(x) = p(x | ω 1 )P (ω 1 ) − p(x | ω 2 )P (ω 2 ) ≥ 0
2.2 Minimum probability of lassiation error for the
Bayes lassier
InFig.2.1,theBayesianlassierfortwoequiprobablelassesforaone-dimensionalfeature
vetor
x
is illustrated. The region to the left of the dotted threshold linelearly ontains most ofp(x, ω 1 ) = p(x | ω 1 )P (ω 1 )
and we dene this asR 1, and the region to the right
1
If
x
isdisretethelikelihood funtionsbeomeprobabilitiesandaredenoted withP ( x| ω i )
Figure 2.1: Illustration of two deision regions. Figure borrowed from [29℄
of the threshold as
R 2. Let R 1 and R 2 be the regions where we lassify x
to ω 1 and ω 2,
R 2 be the regions where we lassify x
to ω 1 and ω 2,
ω 2,
respetively. The total area overed by
p(x | ω 1 )
inregionR 2 and p(x | ω 2 )
in R 1 willbethe
probabilityofausinglassiationerrors. Ifthethresholdismovedtotheleftorright,this
areaand probabilitywillinrease. Thismeans thatif wewant tominimizethe probability
of an error, the deision regions
R 1 and R 2 must be seleted by moving the threshold so
this area is assmallas possible.
In a multilass situationwith a multidimensionalfeature vetor
x
we haveC
dierentlasses with deision regions
(R 1 , . . . R C )
our feature vetorx
an be plaed in. We nowgeneralize the situation in Fig. 2.1. Writing the probability of a orret deision by the
joint probability
P (x ∈ R i , ω i ),
then the probability of erroneously assigning
x
toω j by not seletingthe orret lass ω i
is
P (x ∈ R j , ω i ), ∀ j 6 = i.
The total probability for ommittingan error inlassiation isthus
P e = [
∀j6=i
P (x ∈ R j , ω i ),
= X
∀j6=i
P (x ∈ R j | ω i )P (ω i ),
= X
∀j6=i
P (ω i ) Z
R j
p(x | ω i )dx,
= X
∀j6=i
Z
R j
P (ω i | x)p(x)dx.
The total probability for orretlassiation is
X
i
Z
R i
P (ω i | x)p(x)dx = 1 − P e .
Hene
P e = 1 − X
∀i
Z
R i
P (ω i | x)p(x)dx.
The error is learly minimizedwhen the regions
R i are seleted in away where
R i : P (ω i | x) > P (ω j | x), ∀ j 6 = i,
whihis the same as Eq. (2.1).
Density estimation
Theostfuntionsusedbythelassiersdisussedinthisthesisarealldependentonnding
somesortofdensityestimateforthelassdistributionsofthedata. Theasewhereweknow
the distributions of the feature vetors in eah lass
ω i, given by the likelihood funtions
p(x | ω i )
, is unfortunately not the reality for most data sets. We have to nd estimates
of these distributions. There are basially two ategories of methods for estimation of
pdfs, parametri and non-parametri methods. In this setion, for ompleteness, a short
desription of parametri methods for density estimation is given. For more details [29℄
is reommended. The non-parametri methods are far more important in information
theoreti lassiation, sine we often annot assume that the data set has a parametri
distributionshape. Inpartiular theParzenwindowmethodfor density estimationwillbe
investigated. Throughoutthishapterweassumeadatasetof
N
samples,x i , i = 1, . . . , N
,generated i.i.dfrom unknown distributions,unless otherwise is speied.
3.1 Parametri methods
Assume that the pdf to be estimated is desribed in parametri form by some unknown
parameter vetor
θ
, so it an be written asf (x; θ)
. We have a limited numberN
ofi.i.dtrainingdata,
x 1 , . . . , x N availablefromour distribution. Usingthese sampleswean
use dierent methods to nd an estimate of the parameters in θ
suh that the estimate
f ˆ (x; θ)
is as lose as possible to the true pdf. This means that we assume that our data
is generated from a distribution with a shape that is lose to a parametri form, e.g we
assume a Gaussian, Rayleigh or some other well known distributionand try to adjust its
parameters to t our data as good as possible. The parameters in
θ
are usually foundby maximum likelihood estimation. In [29℄ some methods for parametri estimation are
desribed, e.g. maximum likelihoodestimation, mixture models, maximum entropy et.
3.2 Non-parametri methods
To avoid the need to make assumptions about a parametri shape of the desired distri-
bution, we must often use non-parametri methods. In this setion, we review dierent
methodsofdensityestimationwithaone-dimensionalrandomvariable
x
takenfromaon-tinuous,univariatedensity funtion
f(x)
. Westartwiththe simplehistogrammethodandexpandituntilweendup withthe Parzenwindowestimator. Someof themostimportant
properties of the Parzen window estimator are then disussed. In the last setion, the
Parzen windowmethodis expanded to the ase where we have multivariate distributions,
where the variable
x
is amultidimensionalvetor.3.2.1 The histogram density estimator
The oldest way to nd a non-parametri estimate of a funtion is given by the histogram
[30℄, [9℄. Given an origin
x 0, and a bin width h
, the bins of the histogram are dened as
[x 0 + mh, x 0 + (m + 1)h)
for positive and negative integers m
. The histogramestimateof
the funtion
f (x)
is thenf ˆ (x) = 1
Nh (
noofx i insame bin asx).
(3.1)
Thisestimatorisobviouslydisontinuousand notusablewhen weneedtondderivatives.
3.2.2 The naive density estimator
This is alsoa variantof the histogrammethod. Dene the pdf evaluatedat
x
asf (x) = lim
h→0
1
2h P (x − h < X < x + h)
(3.2)The probability
P (x − h < X < x + h)
an be estimated by ounting the numberof datasamplesfallingintoabinof size
2h
entered atx
. This anbedened morepreiselywitha weight funtion
W (x) = 1
2
if| x | ≤ 1 0
otherwise,
suh that the naive estimator an beexpressed as [9℄
f ˆ (x) = 1 Nh
X N i=1
W
x − x i
h
.
(3.3)Introduinga resalingnotation
W h (u) = h −1 W (u/h)
werewrite Eq. (3.3) asf(x) = ˆ 1
N X N
i=1
W h (x − x i ).
(3.4)From Eq.(3.3) wesee that anestimateforthe pdf at
x
is given by plaingabox aroundeah sample
x i with width 2h
and height (2Nh) −1 and sum up. This estimator is not
3.2.3 The Parzen window density estimator
Parzen generalized the weight funtion
W h ( · )
in Eq. (3.4) toa kernel funtion or Parzenwindow whihis a funtionsatisfying [29℄
K h (x) ≥ 0
andZ
x
K h (x)dx = 1.
Thesubsript
h
refers tothebandwidth orwindowwidth of thekernel[30℄. UsuallyK( · )
ishosen tobe aunimodalprobability density funtion that issymmetriaroundzero. This
makes sure that the estimator
f(x) = ˆ 1 N
X N i=1
K h (x − x i )
(3.5)produes anestimatewhih isalsoa density.
Toinvestigatesomeofitsproperties,wendexpressionsforthemeanvalueandvariane
of Eq. (3.5). Let
f(x) ˆ
be the estimate of the true densityf (x)
atx
, withx ′ a random
variable with density
f (x)
. ThenE { f ˆ (x) } =E { K h (x − x ′ ) }
= Z
K h (x − x ′ )f(x ′ )dx ′
=K h (x) ⋆ f (x).
(3.6)The density estimate is therefore a smoothed version of the true density. The bias of the
estimatoris given by [30℄
E { f ˆ (x) } − f (x) = [K h (x) ⋆ f (x)] − f (x),
(3.7)and the variane is [30℄
V ar { f ˆ (x) } =E { [ ˆ f(x) − E { f ˆ (x) } ] 2 }
= 1
N { (K h 2 (x) ⋆ f (x)) − [(K h (x) ⋆ f (x)] 2 } .
(3.8)It is ommonto measure the loseness of the estimator
f ˆ (x)
tothe target densityf(x)
inthe point
x
by the size of the mean squared error (MSE)MSE { f(x) ˆ } = E n
[ ˆ f (x) − f(x)] 2 o ,
whihan bewritten as
MSE { f ˆ (x) } = 1
N { (K h (x) 2 ⋆ f (x)) − [(K h (x) ⋆ f (x))] 2 } + { (K h (x) ⋆ f (x)) − f(x) } 2
=V ar { f(x) ˆ } + [Bias { f ˆ (x) } ] 2 .
Instead of just estimating the funtion
f (x)
at a single point we want to estimate it overthe whole
x
-spae. The mean integrated error (MISE) is a more appropriate measure for analyzingf ˆ (x)
whereMISE { f(x) ˆ } = Z
MSE { f ˆ (x) } dx
= Z
E { [ ˆ f(x) − f (x)] 2 } dx + Z
V ar { f ˆ (x) } dx.
(3.9)The biasand varianeterm inEq.(3.9)depend onthe kernel width
h
indierentways. Ithas been shown that Eq. (3.9) for large sample sizes
N
, the asymptoti mean integratedsquared error (AMISE) isgiven by [14℄,[30℄
AMISE { f(x) ˆ } = (Nh) −1 R(K ) + 1
4 h 4 µ 2 (K) 2 R f ′′
(3.10)
where
µ 2 (K) = R
z 2 K(z)dz
,R(f ′′ ) = R
{ f ′′ (x) } 2 dx
,f ′′ (x) = d d 2 2 x f (x)
andR(K) = R K(z) 2 dz
. Weseethatminimizingtheleftterm(thevariane)withalargekernel windowh
results in a huge inrease in the bias part whih is proportional toh 4. This is what is
known as the variane-bias trade-o in kernel size seletion. There exists many ways to
nd the kernel size
h
. We mention two popular methods here. The rst dierentiates Eq. (3.10) and equates itto zero, obtainingh AM ISE =
R(K) µ 2 2 (K)R(f ′′ )N
1 5
The other method estimates
R(f ′′ )
by assuming that the true underlying density is anormal density. Then the kernel size is given by [9℄
h AM ISE = 1.06N − 1 5
Several other methodsexist, see [14℄ and [30℄.
3.2.4 The multivariate Parzen window
TheextensionoftheParzenwindowtofeaturedataina
d
-dimensionalspaeisalittlemore diult. The sparseness of data in higher dimensional spaes makes the estimation morediult, unless we have very many samples. This phenomenon is usually referred to as
the urse of dimensionality. Rememberingthat the kernel funtioninthe one-dimensional
ase speify the window width,this window willinthe multidimensionalase be replaed
with hyperubesand eah dimensionof the ube requiresa parameter tobe estimated for
the kernel. A diret extension of the univariate kernel estimate in Eq. (3.5), is obtained
by replaing the point
x
with a vetor-pointx ∈ R d and the variable x i with a d
-variate
d
-variatesample
x i with density f(x)
. The Parzen estimatorbeomes [30℄
f(x) = ˆ 1 N
X N i=1
K H (x − x i )
(3.11)where
H
is a symmetripositivedenited × d
matrix alled the bandwidthmatrixK H (x) = | H | −1/2 K(H −1/2 x).
With further restritions on
H
, see [30℄ for details, we get the single bandwidth kernelestimator
f(x) = ˆ 1 Nh −d
X N i=1
K { (x − x i )/h } .
(3.12)There exists several methodstogivean estimateofthe optimalkernelsize for amulti-
variatedataset. The optimalkernelsizeisusuallyseleted tominimizetheMISE between
f ˆ (x)
and the target densityf (x)
. The normal referene rule for the MISE kernel size isgiven by Silverman'srule [28℄, [9℄
ˆ h = σ x
4 (2d + 1)N
d+4 1
,
(3.13)where
σ x 2 = d −1 P
i Σ ii
andΣ ii are the diagonalelementsof the sample ovarianematrix.
Due to the urse of dimensionality this method is not regarded as reliable for higher di-
mensionaldata. In this thesis most of the data sets are of higher dimensions, so we have
hosen a ross validationtehnique to nd the best kernel sizes inthe density estimates.
Some well-known Parzen windows with
u = x−x h i are listed below, where for u ≥ 1
all
windows evaluate to zero, exept the Gaussian kernel. Forsimpliity we only present the
one-dimensional versions,but they an easilybeextended to multivariateversions.
•
UniformK h (u) = 1/2.
•
EpanehnikovK h (u) = 3
4 (1 − u 2 ).
•
GaussianW σ 2 (x) = 1
√ 2πσ 2 exp (
− (x − x i ) 2 2σ 2
) .
•
QuartiK h (u) = 15
16 (1 − u 2 ) 2 .
•
TriweightK h (u) = 35
32 (1 − u 2 ) 3 .
•
CosinusK h (u) = π
4 cos π 4 u
.
Information theoreti learning priniples
This hapter starts with a brief introdution of information theory. An overview of the
onepts inaninformationtheoreti learningmahineisthen given. Informationtheoreti
riteria,whihgivesusthetoolstomeasuretheshapeof,anddistanebetween,probability
distributions,is then explainedindetail. Inthe lastsetion wedisuss some ost funtion
estimators used in ITL.
4.1 Information theory
Information theory is in this thesis related to C.E Shannon's report from 1948, A math-
ematial theory of ommuniation [25℄. Shannon dened a measure of information or
unertainty assoiated with a stohasti experiment and named it entropy. This measure
was used toanswer importantquestionsin ommuniation. Shannon used entropy tond
alimittohowmuhinformationanbetransferred overanoisyhannel,and tond ways
todesign optimalodes for data ompression.
Entropy an bethoughtof asthe unertaintyassoiated withthe value ofa realization
ofasingle randomvariable. Itisameasure onhowmuhinformationthatisgainedabout
the ontent of astohasti random variable aftera stohasti experiment.
Shannon also dened a measure alled mutual information, whih is the amount of
information that one random variable arries about another, i.e. the redution in the
unertaintyofone randomvariableduetotheknowledgeofthe other. Mutualinformation
is a speial ase of a more general quantity, alled relative entropy. The relative entropy
or divergene an be used as a measure of distane between two distributions. It is a
measure of the ineieny of assuming that a distribution is given by a density funtion
q( · )
,whenitinfathas adistributiongiven byadensityfuntionp( · )
. Thisisalsorefereedtoas adivergene measure between distributions. In this thesis we use dierent estimates
of entropy and divergene asinformationtheoreti measures.
Figure 4.1: A general learning mahine.
4.2 An information theoreti learning mahine
A very general learningmahine may be desribed using a modellikethe one in Fig. 4.1.
Mahine learning is in general divided into supervised, semi-supervised and unsupervised
tasks. The model in Fig. 4.1 an be used to desribe all of the above mahine learning
tasks. Wehave some input data
X
,ontaining informationor measurements from areal- world event. We want the learning mahine to perform some spei task onX
. This isdone by givingthe inputdata
X
to apossibly non-linear parametri mappingfuntiong : R d → R M ,
(4.1)whihtransforms the input vetor
X ∈ R d to Y ∈ R M
Y = g (X, W ),
(4.2)where
W
aretheparametersofthemappingfuntion. Iftheoptimalityriterionisbasedonaninformationtheoretimeasure, eitherentropyordivergene, weallthis aninformation
theoreti learningmahine. The mapperfuntionin Eq. (4.2)transform theinput data to
a new formdepending onthe task of the learningmahine. The outputof the mapper,
Y
isompared with anoptimality riterionand optionallyadesired response
z
for theinputX
. For eah presentation of training data the optimality riterion is evaluated and the error terme
isfed to anadaptation algorithmwhih update the parametersW
.Supervisedlearningonerns alearningmahinewith adesiredresponse foreahinput
X
. This thesis fous on supervised learning, speially on lassiation of data. If thelearningtask is lassiation,the desiredresponse
z
for the training data ontains a lasslabel. Thetaskisthentolassifyarandominputsampletooneof
C
lasses. Themappingfuntion may inthis ase bedesribed as
g : R d → { y 1 , . . . , y c } ,
where
y 1 , . . . , y c arethepossiblelasslabels, andM = 1
inEq.(4.1). Regressionisanother
example of atypialsupervised learningtask, where the mappingis
g : R d → R .
Semi-supervised learningtasksonerns learningwhere thelabelsofthe inputdataare
only partially known. One example is ranking where we only have available the relative
orderingofthetheexamplesinthetrainingset,whileouraimistoenableasimilarordering
of noveldata.
Unsupervised tasksare learningtaskswhere thewanted informationfromthe datahas
tobeextrated withoutany desiredresponse inthe optimalityriterion. Clustering isone
typialexampleofunsupervisedlearning,wheretheaimistondanaturaldivisionofdata
into homogeneous groups [26℄. Anomaly and novelty detetion are other examples, where
the taskistodetetsamplesthatdeviate fromthenormal. Other importantunsupervised
tasks are nding low-dimensional representations of the input data, important examples
ofthis isprinipal omponentanalysis (PCA)and independentomponent analysis (ICA).
In PCA, the mapping inEq. (4.1) aims to projet the input
X
toa lowerM
-dimensional spae, whereM
denotes the number of unorrelated features inX
. In ICA the goal is toprojet
X
to a lowerM
-dimensional spae where eah of the features ofX
are mutuallyindependent.
Traditionallythe riteria foroptimalityinFig. 4.1has been tominimizethe MSEost
funtion between the output
Y
of the mapperand the desiredoutputz J(Y ) = E
(z − Y ) 2 .
(4.3)Inourgeneralmahinelearningmodelwewanttotransferasmuhinformationaspossible
aboutourdataintothemapperfuntion
g(X, W )
,suhthatthismapperisabletodesribeour data as aurately as possible. The optimality riterion is thus ritial in obtaining
the parameters
W
. If weuse MSE, the informationtransferred fromthe measurementsX
andthe desiredresponses
z
totheparametersW
ispurelybasedonseondorderstatistisonstraints. This is only optimal if the input data is drawn from Gaussian distributions,
whihis arather strit restrition.
To transfer as muh information as possible to the parameters
W
the error term inFig. 4.1 must be omputed with a riterion transferring as muh information as possible
about the input data
X
, and the desired responsez
, to the parametersW
of the mapperY = g (X, W )
. If we base our alulations on informationtheory and optimize with infor- mation theoreti riteria in Fig. 4.1 and Eq. (4.2) we have what Prinipe et al desribesas information theoreti learning [18℄. The main advantage with information theoreti
riteriaare that they are funtions of probabilitydensities and apture alldata statistis,
not justthe seond-order statistis. This givesuslearningmahineswhere the parameters
W
desribes our data ina muh better way than the traditionalMSEan.4.3 Information theoreti quantities
In this setion we disuss some important information theoreti quantities that may be
used as informationtheoreti riteriain anITL mahine.
Denitions A disrete stohasti variable
X
is assoiated with a triple(x, A X , P X )
,where the outome
x
is the value of the stohasti variable whih takes on a set ofpossible values
A X = { a 1 , a 2 , . . . , a N }
. These have probabilities (distribution)P X = { p 1 , p 2 , . . . , p N }
, whereP (x = a i ) = p i, p i ≥ 0
and P
a i ∈A X P (x = a i ) = 1.
A ontinuous stohasti variable
X
is assoiated with a probability density funtionf X (x),
wheretheoutomex
isthevalueofthestohastivariable. Thepdfisdenedasthederivative of the umulativedistribution funtion (df), dened as
P (X ≤ x 0 ) = F X (x 0 )
,where
0 ≤ F X (x) ≤ 1
. Hene,f X (x 0 ) = ∂ ∂ x F X (x) | x = x 0, and
R ∞
−∞ f X (x)dx = 1.
We have statistialindependene between random variables
X 1 , . . . , X N if and only if
f (x 1 , . . . , x N ) = Q N
i=1 f(x i ).
A metri ona set
X
is a funtionu : X × X → R
. For all x,y,z inX
, this funtion isrequired tosatisfy the followingonditions
1.
u(x, y) ≥ 0
.2.
u(x, y) = 0
if and only if x=y.3.
u(x, y) = u(y, x)
.4.
u(x, z) = u(x, y) + u(y, z)
.4.3.1 Entropy
All information theoreti riteria are related to the onept of entropy. We now explain
Shannon's measure of entropy and some of it's properties. A more general version of
entropy, the Renyi entropy is then reviewed.
Shannon's entropy
Assume there is some unertainty in the outome of an randomexperiment and that the
possible outomes of the experiment is given by a probability distribution. This uner-
tainty was rst quantied by Shannonas
H = H N (p 1 , p 2 , . . . , p N )
satisfyingthe followingriteria[25℄ [9℄
1.
H N (p 1 , p 2 , . . . , p N )
is asymmetri funtionof itsvariables.As anexample,
H N (p 1 , p 2 , . . . , p N ) = H N (p 2 , p 1 , . . . , p N ).
2.
H N (p 1 , p 2 , . . . , p N )
is aontinuous funtion ofp 1 , p 2 , ...., p N.
3.
H N ( N 1 , . . . , N 1 )
attains the maximum value.4.
H N +1 (tp 1 , (1 − t)p 1 , p 2 , . . . , p N ) = H N (p 1 , p 2 , . . . , p N ) + p 1 H 2 (t, 1 − t)
for any distri-bution
p X and 0 ≤ t ≤ 1
.
The fourth property of Shannon entropy may be explained as follows [25℄. If a hoie
is broken down into two suessive hoies, the original entropy (
H
) is the weighted sumof the individual values of
H
[25℄. This isillustrated in Fig. 4.2.Figure 4.2: At the left we have three possibilities, eah hosen aording
to the probabilities
p 1 = 1 2, p 2 = 1 3, p 3 = 1 6. On the right, we rst hoose
p 3 = 1 6. On the right, we rst hoose
between two possibilities eah with probability
1
2
. If the seond possibility is hosen, we make another hoie with probabilities2 3
and1
3
. The nalresults have the same probabilities as before. We require, in this speial
ase, that
H( 1 2 , 1 3 , 1 6 ) = H( 1 2 , 1 2 ) + 1 2 H( 2 3 , 1 3 ).
The1 2
oeient is beausethis seond hoie only ours half the time.
Shannon showed that the only
H
satisfyingthe above assumptions is[25℄H N (p 1 , p 2 , . . . , p N ) = H N ( P X ) = − K X
p i ∈P X
p i log b p i ,
(4.4)withtheonventionthat
0 log b 0 = 0
. Thismeasurehealledentropy,beauseitisthesameexpression used to deneentropy in statistialmehanis.
K
is someonstant, dependingof the units of the sample data. If the base
b = 2
, the entropy unit isbits and ifb = e
theunitisnats. Inthis thesisweleavethebase
b
ofthelogarithmunspeied,sineitisjustameasurement sale. Entropy isusually denoted by
H(X)
whereX
is alabel fora randomvariable, and not the argument of afuntion. Shannon's entropy depend onthe quantity
I(p i ) = − log p i ,
(4.5)whihwas proposed by Hartley asa measure of the informationreeived by learningthat
a single event of probability
p i took plae [7℄. Hene the Shannon entropy is a weighted
average of informations
I(p i )
H(X) = E { I(p i ) } .
(4.6)Properties of Shannon entropy
SeveralpropertiesoftheShannonentropyanbederivedbasedonthefourbasiproperties
[9℄[25℄ [3℄
1. Addingorremovinganeventwithprobabilityzerodoesnotontributetotheentropy,
H N (p 1 , p 2 , . . . , p N , 0) = H N (p 1 , p 2 , . . . , p N ).
2. It vanishes when one outome is ertain,
H N (p 1 , p 2 , . . . , p N ) = 0
,p i = 1
,p j = 0, j 6 = i, i = 1, . . . , N
.3. The maximum of
H N inreases asN
inreases.
4.
H N ≥ 0
.Example ToillustratesomepropertiesofShannon entropyletthe stohastivariable
X
be given by
X =
1,
with probabilityp 0,
with probability1 − p
.The entropy inthis ase is
H(X) = − p log(p) − (1 − p) log(1 − p),
asshown inFig.4.3asafuntionof
p
. NoteinFig.4.3thatthe entropy iszerowhenp = 0
or
p = 1
,meaningthereisnounertainty aboutthe outome ofthe stohastiexperiment.If
p = 1 2 the unertainty is maximized, and we need on average 1 bit to transmit the
outome of the experiment.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.2 0.4 0.6 0.8 1
p
H(p)
Figure 4.3:
H(X) = H(p)
in bits with Shannon's entropy, notie thatH(X) = 1
, whenp = 1 2.
Shannon dierential entropy
Fora ontinuous stohasti variable
X
with densityf (x)
1 the dierentialentropyh(X)
isdened as [25℄ [3℄,
h(X) = − Z
f(x) log f (x)dx.
(4.7)This an alsobewritten asan expeted value
h(X) = E f {− log f(x) } .
(4.8)Properties of Shannon's dierential entropy
1. If
X
is limited to a ertain volumev
in spae, thenh(X)
is maximum and equal tolog v
whenf (x)
is onstant (uniformdensity funtion)in the volume.2. Dierential entropy may be negative. If we onsider the uniform density above, for
v < 1
,log v < 0
.3. The normal distributionmaximizes the entropy over alldistributions with the same
ovariane. This property an be exploited to measure the non-Gaussianity of a
stohasti variable.
4. The dierentialentropy is ameasure that is relativetothe oordinatesystem. Con-
sider for example hanging oordinates by a linear transformation,
Y = MX
. Inthat ase,
h(Y) = h(X) + log | det(M) | .
Renyi's entropy
Asexplained above,Shannon'sentropy isameasure ofthe average amountofinformation
ontained ina single observation of a random variable. Renyi used a more generaltheory
of mean values, where the mean of the real numbers,
x 1 , . . . , x N, with positive weighting
p 1 , . . . , p N, has the form[18℄ [22℄
x = ϕ −1 X N
i=1
p i ϕ(x i ),
(4.9)where
ϕ(x)
isaKolmogorov-Nagumofuntion,whihisanarbitraryontinuousandstritly monotonifuntiondened onthereal numbers. Hefound thatageneralentropy measureH
obeys the relation[18℄H = ϕ −1 X N
i=1
p i ϕ(I(p i ))
!
,
(4.10)1
where
I(p i )
is Hartley's information measure. In order to be an information measureϕ( · )
annot be arbitrary, sine information is additive. We have two hoies,ϕ(x) = x
or
ϕ(x) = 2 (1−α)x. The rst ase gives Shannon's entropy and the seond gives Renyi's
entropy of order
α
[18℄H R α (X) = 1 1 − α log
X N k=1
p α k
!
α > 0, α 6 = 1.
(4.11)There is a well known relation between Shannon's and Renyi's entropy. Let
H S denote
Shannon's entropy, then [18℄
H R α ≥ H s ≥ H R β if 0 < α < 1
and β > 1,
α→1 lim H R α = H S .
Renyi's and Shannon's entropies an also be related to eah other in another way. If we
onsiderthe probabilitymass funtion
P = (p 1 , p 2 , . . . , p N )
asapointin theN-dimensional spae, this point will always be in the rst quadrant of a N-dimensional hyperplane witheahaxis interseting the oordinateone. The distane of the point
P
tothe originis theα
root ofd α = X N k=1
p α k = || P || α
and the
α
root ofd α is alled the α
-norm of the probability mass funtion [18℄. Renyi's
entropy satises all of Shannon's riteria in 4.3.1 on page 23. Exept of the fourth prop-
erty. The Renyi's entropy of order
α = 2
, is denoted by Renyi's quadrati entropy andorresponds to the 2-normof the probabilitymass funtion.
If werepeat the exampleon page24with Renyi'squadrati entropy measure, we get a
similar shape inFig. 4.4, as inShannon's entropy in Fig. 4.3onpage 25.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.2 0.4 0.6 0.8 1
p
H R 2 (p)
Figure 4.4:
H(X) = H(p)
in bits with Renyi's quadrati entropy measure, notie thatH(X) = 1
whenp = 1 2.
Renyi's dierential entropy
For the ontinuous random variable
X
with pdff (x)
we obtain the dierential version ofRenyi's entropy [18℄
h R α (X) = 1 1 − α log
Z
f α (x)dx,
(4.12)= 1
1 − α log E f { f 1−α (x) } .
(4.13)Ifweset
α = 2
, we get the dierentialRenyi quadrati entropyh R 2 (X) = − log
Z
f 2 (x)dx,
(4.14)= − log E f { f(x) } .
(4.15)Some properties of the Renyi entropy of order
α
are the following[9℄1. Just asfor Shannonentropy, theRenyi entropyis maximizedfora uniformdistribu-
tion for randomvariables with nite support.
2. The Renyi entropy is not in general maximized by the Gaussian distribution in the
xed varianease.
3. The Renyi entropy is invariantto rotationsand translations.
4.3.2 Divergene
This setionreviews some of the mostommonmeasures ofdivergene orrelative entropy
used in information theoreti learning. Divergene is used as a measure of statistial
similarity, and one an think of it as a generalization of algebrai distane measures to
probability spaes [5℄.
Kullbak-Leibler divergene
This measure disriminates two probability density distributions
p(x)
andq(x)
, and it isalsoreferred to asrelative entropy
D KL { p, q } = Z
p(x) log p(x) q(x) dx,
= E p
log p(x) q(x)
.
(4.16)Some properties of this measure are [9℄