UNIVERSITETET I OSLO Institutt for informatikk
Implementing an automatic face recognition system
Hovedoppgave
Nikolai Czajkowski
1. august 2006
The goal of this projet was toimplement and evaluate anautomati fae reogni-
tion system. Fae loalization was adapted from the algorithm of Rowley et.al. A
20 × 20
pixel sliding window sans a pyramid of the input image, feeding intosev-eral neural networks. The router network estimates rotation, while three detetor
networks lassify the window as a fae or non-fae. This approah is very ompu-
tationally intensive, and estimating rotation is the most demanding aspet. Our
ontributiontothe loalizationalgorithmis toreduethis proessingtime by i)us-
ingagenetialgorithmto evolve a moresparsely onneted routernetwork, and ii)
tolimitthesearhspaebythresholdingtheinputimageonskinolourandtexture.
An eigenfae based algorithmwas used for fae reognition. Our ontributionwas
to have the system automatially extrat multiple and dierently segmented faes
fromthe trainingimage, rather than manually segmenting a single fae.
Evaluationoftheloalizationalgorithmdemonstratedthatitisgenerallyrobust,
abletoorretlyandauratelysegmentfaesdespitevariationsinlighting,rotation,
sale,faialexpression andolusion. Furthermore,althoughthedierenebetween
networktopologieswassmall,theevolvedroutertopologyimprovedspeedbyroughly
13%, while improving the loalization rate ompared to Rowleys fully onneted
topologyby3%. Texture thresholdingredued therunningtimeofthe algorithmby
almosttwothirdsinbotholourandgraysaleimages,whileskinolourthresholding
independently redued the running time by 45%. In olour images the ombined
eet was aredution in proessingtime of 75%.
Benhmarking of the reognition algorithm demonstrated that eigenfae-based
reognitionisverysensitivetovariationsinfaesegmentation relativetothestored
template. However, beause automati segmentation is likely to vary somewhat
omparedto amanualsegmentation of the same fae,performane of the eigenfae
based approah an be improved by automatially extrating and storing several
slightlydierently segmented templates per person.
TheprogramwasimplementedinC,andnewlibrariesforneuralnetworks,image
proessing,TIFF reading/writing and geneti algorithmswere written tomake the
system fast and exible.
I would like to thank my supervisor Anne Solberg for a patiene that has now
spanned a onsiderablenumberof semesters, and for muh helpful advie.
Iwould alsoliketothank myfriends, family,and girlfriend. I'mlookingforward
toseeing youallagain.
Oslo, July 2006
Nikolai Czajkowski
Aknowledgments . . . 3
1 Introdution 6 1.1 Biometris . . . 7
2 Introdution to the onstituent parts of the algorithm 11 2.1 Fae pereption, brain and biology; what an be learned, and what should be used? . . . 11
2.2 Colour and olour spaes . . . 14
2.3 Neural networks . . . 15
2.3.1 Neurons - The basi buildingbloks . . . 16
2.3.2 Network topology . . . 17
2.3.3 Network training . . . 19
2.4 Geneti algorithms . . . 20
2.4.1 Elementsand operatorsin genetialgorithms . . . 21
2.4.2 Running ageneti algorithm . . . 23
2.4.3 Geneti algorithmsfor neuralnetworks . . . 23
3 Fae loalization 25 3.1 An overview of the algorithms . . . 25
3.1.1 Feature-based/knowledge-based algorithms . . . 25
3.1.2 Image based algorithms . . . 25
3.2 Illustratinghallenges with a model-based algorithm. . . 26
3.3 The importane of aurate loalizationfor reognition . . . 30
3.4 Rowley's neuralnetwork-based approah . . . 31
3.5 Our fae loalizationapproah . . . 33
3.5.1 Preproessing: Colour transformation and onstruting the image pyramid . . . 33
3.5.2 Optimizing the searh by using olour and texture . . . 33
3.5.3 Trainingthe networks . . . 36
3.5.4 Usinganevolutionaryalgorithmtoimprovethenetworktopol- ogy . . . 38
3.5.5 Reduing falsepositives . . . 41
3.5.6 Putting it alltogether: Loalizing afae . . . 42
3.5.7 Other possible optimizations . . . 44
4.1 An overview of the algorithms . . . 46
4.1.1 Feature based methods . . . 46
4.1.2 Image based methods . . . 47
4.1.3 Reognition by eigenfaes . . . 47
5 Implementation details 49 5.1 TIFF reading/writing library . . . 49
5.2 Image proessing library . . . 50
5.3 Neural network library . . . 50
5.4 Geneti algorithmlibrary. . . 52
6 Benhmarking 53 6.1 Fae loalization . . . 53
6.1.1 Benhmarking set . . . 53
6.1.2 Soring hits and misses . . . 54
6.1.3 Trainingdata, algorithmparameters and test system . . . 55
6.1.4 Benhmarking results . . . 56
6.1.5 Disussion and omparison to othersystems . . . 60
6.2 Fae reognition . . . 61
6.2.1 Benhmarking set . . . 61
6.2.2 System setup . . . 62
6.2.3 Benhmarking results . . . 62
6.3 Comparisons toother systems . . . 64
7 Conlusions and personal omments 65 7.1 On amore personal note . . . 66
7.2 Conlusion . . . 68
8 Appendix: HS lustering algorithm 68
9 Appendix: Sample images from model-based loalization 70
10 Appendix: Sample images from fae loalization benhmarking 70
In 2001 the ity of Tampa, Florida hosted the annual Amerian Football league
nal,the SuperBowl. That year, in additiontothe regularmediafrenzy,the nals
reeived anextradose ofmediaattention asityoialshadinstalledanautomati
faereognitionsystem,FaeIt,inthestadiumentrane. Camerasreordedpeople
entering, FaeIt loalized their faes in the video stream and ompared them to
images ina polie database. The initiative slingshot biometrisystems into publi
awareness, and earnedtheitythe Worst publioial awardfromPrivay Inter-
national. Similarsystems have sinebeen installedinahandful ofloations,but so
farthey havenot ledtoany arrests. Thelak of arrestshighlightsthe sobering fat
thatdespitethereentupsurgeinpubliimaginationandavailabilityof ommerial
faereognitionproduts,thistehnologyhas yettomature,and robustreognition
remainsa hallenge.
Perhapsheremorethaninanyotherareaofmahinevision,therelativeeasewith
whihpeopleperformthetaskomparedtoomputersbeomesapparent. A frontal
faein your eld of vision potentially indiatesthat anindividual has its attention
diretedatyou, andthis ouldbeanimportantevent. Also,formoresoialanimals
likeus,itisvitaltobeabletoreliablydistinguishbetweenthemembersofourgroup.
Asvisionisourprimarysense,itisnaturalthatthisdisriminationisbasedonvisual
pereption. Faes are suiently omplex to ontain the neessary information
to distinguish individuals, and even ontain informationabout the person's mood.
As fae reognition is so important for us, many have speulated that speialized
neural strutures have evolved to failitate this task. The degree to whih people
are endowed with suh speialized mental modules is stillheatedly debated, and
we will briey return to it later. What is ertain however, is that regardless of
howspei the neuraliruitsare, already shortlyafterbirth hildrenare oriented
towards fae-likevisual stimuli[27℄. Byfae-like werefer to patterns haraterized
by symmetry and a ertain degree of omplexity. So, from the very beginning and
throughout life, individuals are exposed to a massive amount of faes. Combined
withpossiblehard wired neural iruits,this experieneresults infaes being the
lass of stimuli where we have the nest level of disriminability. As with many
tasks in pattern reognition, what seems almost eortless to humans has proved
very diult to implement robustly in a omputer, even though automati fae
detetion and reognition has been an ative area of researh for almost 30 years
[42℄.
in the algorithm, suh as olour spae, neural networks and geneti algorithms. I
assume the reader is familiarwith these subjets already, but a short introdution
willlater make it easier to disuss their purpose in the implemented system. Any
system that attempts to reognize a fae, or try to respond to the presene of one
insome way,must rstnd itinthe image. Inpratie thisinvolvessegmenting the
faefromthebakgroundbyndingasexatlyaspossibletheboundingoordinates.
While loalization and reognition may be intertwined in human fae reognition,
itpresents a natural divide in automated algorithms, and I have hosen to present
these steps of the algorithm separately in hapters 3 and 4. Chapter 5 gives a
summary of how the ideas outlined in the previous hapters are implemented into
a working system, as well as presenting some major design deisions. In hapter 6
results from benhmarking of the system are presented. In hapter 7 I reet on
some of the more general lessons learned from implementing the system the way
that we did.
1.1 Biometris
Biometris(anientGreek: bios ="life",metron ="measure")is the
study of automated methods for uniquely reognizing humans based
uponone ormore intrinsiphysial orbehavioraltraits [36℄.
The denition given above ontains two riteria that must be fullled by any
biometrisystem. It must be automated in the aquisition of data, the extration
of relevant features, and the lassiation. Seondly, the features where by this
lassiationisperformedarephysialor behavioral. Thesefeaturesan beaquired
fromoneormorephysialdomains,suhasngerprints,iris,voie,fae,bodyshape,
weight et. Howthe system subsequently proesses the data depends onwhat kind
of data it has olleted, as well as the purpose of the system. In terms of purpose,
it is ommon to dierentiate between identiation and identity veriation. An
identiation appliation suh as the FaeIt system introdued above, attempts
to determine the identity of a given individual. On the other hand, an identity
veriationsystemattemptstoverifythattheindividualreallyiswhohe/shelaims
tobe.
Before proeeding, let us refresh some fundamental onepts in lassiation
theory. In fae loalization, the task of the lassier is to determine whether a
feature vetor represents a fae. The lassier an ategorize the input orretly,
makesanerrorit aneither beby indiatingafaeispresent whenit infatisnot,
false positive, orlaimingthat it isnot present when itis, false negative.
Often there isa trade-obetween the degreeof ertainty of abiometrimethod
andhowinvasiveitisinaquiringtheneessaryinformation. DNAanalysis,whihis
beominginreasinglyautomated,andan underertainirumstanes beregarded
asabiometrisystem,willidentifyapersonwithvirtuallynoleveloferror. However
the proedure requires a physial sample of the individual. So, biometri systems
dier in many respets, how do we ompare alternative approahes? Biometri
systems an be ompared onthe following riteria[14℄:
•
Universality: towhat degree everyone posses the neessary features.•
Uniqueness: whether the physial features uniquely distinguish between dif- ferent individuals.•
Permanene: whether the physial features used are invariant over time, oran be alteredby the individual.
•
Aeptability: towhat degree people willaeptbeing subjeted to analysis.•
Colletability: howeasy itis to olletthe neessary data.•
Performane: how well the best algorithms perform in terms of speed, au-rayand robustness.
•
Cirumvention: howeasy it isto foolthe system.As an example, take ngerprints. They have fairly high universality, although
sarring an interfere with lassiation, and an estimated 5% of the population
does not have elligable prints [8℄. Fingerprints are to suh an extent unique that
even idential twins dier, as do the prints of the same individual aross ngers.
Furthermore,ngerprintvarylittleovertime,andurrentbiometrisystemsanuse
themtolassifyindividualswithverylowlevelsoferror. However, ngerprintshave
some aeptaneissues, as they are usually assoiated with riminalinvestigations.
Also, in olleting the data you need the individuals ooperation, or at the very
least, awareness and attention. On the other hand, fae reognition an be arried
out from a distane, even without the individual being aware. So universality,
aeptability and olletabilityof faereognition systems arehigh. However, faes
Furthermore, they vary over time as people wear glasses, grow beards or put on
makeup. The performane of state of the art algorithmsistherefore stillfairly low.
Simplied, any biometri system onsists of two major omponents. First fea-
turesmustbeextrated,thenthey must beomparedwithtemplates inadatabase.
One data of a ertain physial domain has been sampled, and before the lassi-
ation an be arried out, there is usually some preproessing required. When
a ngerprint sanner gathers data, some translation or rotation may be neessary,
but otherwise the data is fairly well segmented. This is not the ase for most fae
reognition appliations. Their input is usually a two dimensional image, and the
rst task of the system will invariably be to nd as aurately as possible the o-
ordinates of any faes present. Findingfaes in images may seem trivial, after all,
we all know what they look like. However, I will throughout this thesis attempt
todemonstrate that loalizationis in fat a hallenging task, at least as the visual
sene the individual is in beomes more omplex and the lighting onditions less
ideal.
Some of the fators ompliating the task of fae loalizationwill be disussed
and illustrated insetion 3.2. Fornow it issue tosay that the shapeof the fae
an ause two-dimensional representations to vary onsiderably when light is ast
fromdierent angles. Furthermore, olusion by objets suh as igarettes, glasses
andin turntheir shadows makesdetetion stillmore diult. Also,faesan be at
any distane or orientation relativeto the amera.
If loalizationissuh a hallenge,is itworth the eort? Why would we need to
loalize people in images? The most obviousand perhaps most proled use of this
tehnology is surveillane and seurity. Given the esalating onerns with airport
seurity and border ontrols, the tehnology has at least in these setors enormous
eonomipotential. Butthereare otherand perhapsmorebenignusesfor thesame
tehnology that reeive less attention. With the steep growth in sales of digital
ameras, the integration of inreasingly high quality digital ameras in embedded
eletronis suh as mobile phones and PDA's, digitization of video arhives, pro-
liferation of broadband aess, vasts amounts of video and images are beoming
available. Yet if this rise in availability is to result in a genuine inrease in aes-
sibility, then eah image must be labeled and tagged with ontent informationfor
retrievaland searhing. Unless suhtextual informationregarding ontentisstored
alongwith this data,itwillinreasinglybe rendered inaessible,lostinajungle of
information. The sheer amount of images to searh manuallybeomes prohibitive.
In CBI, a database is indexed not as onventionally done through a tag, verbal
desriptor or elementary data type, but by the atual ontents of the multimedia
item. In other words, rather than having a tag desribing the ontents along with
the data, ex. ar, the information is referened through features extrated from
thedataitself. Thereasonwhyfaeloalizationandreognitionalgorithmsarepar-
tiularlyrelevant to CBI, is that automati faereognition is one of few attempts
to lassify objets that an vary substantially over time and situations, but where
the dierenes between the individuallass members an be very small. I.e. a per-
son an hange faial expression, grow a beard or wear glasses, but the dierene
between individual people is relatively subtle. Fae reognition is therefore par-
tiularly diult, ompared to automated reognition of rigid objets suh as ars
or airplanes. Also, beause the relevant features needed to dierentiate individual
faes an hange under dierent onditions, most of the proposed algorithms are
abletolearnfromaset of trainingexamples. They arethereforesuiently generi
tobe appliedto loalizationorreognition of most other lasses of objets, given a
dierent trainingset.
Human mahine interation is the nal use for fae loalization/reognitional-
gorithmsIwillmention,anditisonewhihislikelytoinrease inimportaneinthe
near future. Faes are not just any lass of stimuli. We expet feedbak to be di-
retedtoourfaes, and our expressions alsoonveya lotof informationbeyond our
identity. Therefore, ashuman-omputer interation beomes more widespread, the
need for robust ways to detet, reognize and interpret the nonverbal information
present infaes willbeome more important.
rithm
2.1 Fae pereption, brain and biology; what an be learned,
and what should be used?
Engineers are ofteninspiredby the elegane andfuntionalityof biologialsolution
todiultproblems. However, astheyareworkingwithverydierentrawmaterials,
onvastlydierenttime salesandlevelsofomplexity,itisnotgiventhatbiologial
solutions an or should be applied to any given engineering problem. However, as
willbeevidentthroughoutthisthesis,several aspetsofthepresented algorithmare
inspiredby biologial systems and proesses. Also, as fae reognition is a skill all
healthy hildren develop seemingly withouteort, itmight beuseful tosketh how
this is aomplished by the human brain. This setion therefore gives a very brief
introdution to the funtional neurobiology of pereption, before desribing some
relevant and importantndingsin neuropsyhology.
Wean denereeptive eld of aneuronastheregionofthe visualeldinwhih
stimuliwill aet its level of ativity. As one moves up the neural hierarhy, from
the sensory ells to assoiation areas of the neoortex 1
, the reeptive elds of the
ells beome larger, and the stimuli that elit their response beomes inreasingly
omplexand abstrat [5℄.
The neural proesses of reognition begin in the retina, where light foused
throughthelenshitsphoto-pigmentmoleulesinthespeializedellsliningthewall,
ausingthese moleules tobreak down. The onstituent parts of the breakdown in
turnause ionhannels 2
tohangeshape,allowingan inuxof Na
+
,ultimatelyre-
sultinginaneletrialsignal. However, the retinadoesmorethansimplytransform
lightto aneletrohemialanalogue. A large amountof low level signal proessing
is performed in the retina. Information is arried from the eyes through the opti
nerve, whih is formed of speial ells alled ganglion ells. Due to proessing in
the retina,ativity of ganglionells does not merely represent the presene of light
at a partiular point in our eld of vision. Rather, these ells respond maximally
todierenes inintensitybetween the enter and the surroundingin theirreeptive
eld. This way retinal ganglion ells onvey information about disontinuities in
1
phylogenetiallynewerregionsofthebrainthat areneithermotororsensorybutarethought
tobeinvolvedinhigherproessingofinformation.
2
omplexmoleulesin theellmembrane
Through the opti nerve, informationis relayed through parts of the mid-brain
to the primary visual ortex. Still, in the region that the opti nerve feeds into
the ortex, the ring of the neurons represent very basi and fairly understandable
aspets of the visual stimuli. However, as one moves away into the surrounding
assoiative ortex, more abstrat and integrative qualities of the visual stimulus is
proessed(as istrue for all otherprimary sensory ortial areas).
However, one of the broad distintionsusually made invisualproessing isthat
of the ventral and dorsal stream 3
. The ventral stream passes information to the
temporallobe,andismostly onernedwith shape(objet)reognition. Thedorsal
stream follows a path up toward the parietal lobe, and is mostly responsible for
loalizingobjetsinspae. Aswefollowthedorsalstreamfurthertowardsthemotor
areasofthebrain,theneuraltissuegraduallyproessesourreation toobjets, suh
asreahing forthem. Thesetwo aspets,loalizationand reognition,antherefore
be dissoiated if one of the areas is damaged. A person may reognize an objet,
but isunable toquikly and aurately reah for it,orvia versa.
The introdution given to neural proessing of visual information is of ourse
onlyanabstration. Theoptinerveismadeup ofroughlytentimesasmanybers
astheohlearnerve 4
,andalargepartoftheortexandmanysubortialareasare
insomewayinvolvedintheproessingofvisualstimuli. Thestrutureof theortex
is far more homogeneous than the phylogenetially older neural areas. Therefore,
purely strutural and physiologial knowledge gives limited understanding of how
and what informationisproessed.
Mostknowledgeofnormalbrainfuntioning inhumansomesnotfromknowing
itsstruture, but fromstudying damaged brains. Broadlyspeaking the three main
souresofsuhdamageisshrapnel fromwarwounds, strokesandtumors. However,
omaredtowhatweknown aboutdamagetomotorareasintheparietallobe,muh
less is known about damageto the highervisual areas. The blood vessels that feed
theparietalmotorareas aremuhmoreonvoluted andvulnerableforrupture than
those of the oipital lobe. Shrapnel damage to the visual areas (oipital lobe)is
alsolikelytodamagebrainstemareasontrollingbreathingandbloodpressure,and
resultin fatality.
Damage to the eyes will of ourse result in blindness, and the same is true of
3
Ventralanddorsalaremedialtermsindiatingtowardsthefront andtowardsthetopand
bak,respetively
4
auditory nerve
higherintheproessinghierarhy anresultinlossofhigherlevelfuntions,suhas
olour blindness or inability to pereive movement. Agnosia is a term referring to
thelossof abilityto interpretthe stimuliinagiven sensory modality. Forexample,
apatient withvisual agnosia an experiene allthe onstituent elements ofperep-
tion,suh asolour, texture,edges, movement et., but isunable tointegratethem
all into a meaningful whole. One rare but highly studied version of this disorder
is that of prosopagnosia, thought to be a seletive impairment in reognizing faes
resultingfromdamagetoertainareas oftheventralstream. Apatientisoftenable
to judge orretly the expression of faes and to disriminate between instanes of
other lasses of objetssuh vegetables and fruit. Prosopagnosia is among the evi-
deneoftenitedasindiatingdediatedneuraliruitsevolvedforfaereognition.
However, just how seletive the impairmentis is a matter of some ontroversy [7℄,
as most patients have redued performane on many tasks requiring subtle visual
within-ategory disrimination. Fae reognition is perhaps the most omplex vi-
sualtaskpeopleroutinelyperform, andsuddenlyfailingtoreognizetheirspouse or
hildrenisertainlythemostnotieableanddebilitatingresultofthebraindamage.
Fora further disussion see [2℄ or[13℄.
While muh is known about the low-level visual system, neurophysiology has
not signiantly guided fae reognition algorithms beyond providing a biologial
justiation for using gabor lters and feed-forward neural networks 5
[26℄ . But
althoughtraditionalneuralnetworksareinspiredbygenuineneurons,they represent
onlyaverysimpliedmodel. However, more biologiallygrounded approahes that
mimi the biology of low-level vision have been published [26℄. So, what of the
higher order properties? One important question that ould impat the design of
ouralgorithmistowhatdegreefaepereption reliesonindividualfaialfeaturesor
anholistievaluation. Dofeaturesaseyes,noseandmouthindependently ontribute
toreognition,orisfaereognitionperformedonthefaialareaasawhole,without
breakingitdownintofeatures. Thisquestionisimportant,asitmirrorstwodierent
strategies in biometrisystems. We willlater desribe the eigenfae approah that
we have implemented, and this approah does not extrat features from faes, but
simplyompares two image areas.
Lastlyitshouldbementionedthatthebrainusesallinformationavailable,rang-
ingfromothersensorymodalitiestotheontextingeneral. Faesareloalizedfaster
if they are in a familiar ontext, and reognition is better if we an hear the per-
5
Duringtherstfewmilliseonds,visualproessingisthoughttobefeed-forward.
multimodalbiometrisystems 6
has superior performane ompared toonventional
systems, and isgaining popularity, [16℄.
2.2 Colour and olour spaes
A olour spae, is a mathematial way of representing olours as data, typially
as three or four values or olor omponents [34℄. Many dierent olour spaes are
dened, and the one to use for agiven task depends on the purpose of the system.
This is onveniently illustrated with the RGB olour model, arguably the most
familiar one to most omputer sientists. In traditional CRT sreens olour was
generated by energizing tiny dots of red green and blue oloured phosphorous with
aneletronbeam. Thesethreeoloursinturnstimulatethreeseparate typesofells
inthehumanretina,anddierentoloursanbegeneratedbymixingtheintensities
of these basi olours. For the purpose of displaying an image to sreen, an RGB
olour model may be the optimal one to enode the data, but for other tasks an
alternativeolourspae may bebettersuited. Forexample,inRGB,lightintensity
is not independent of hue. Rather, intensity is the mean value of all three olour
dimensions. Ifwe regard say a red objet ina brightly lit room, and then dim the
lights, the objet arguably has the same olour it only reet less light. Several
olourspaes are dened sothat intensity is independent of hrominane, as inthe
HSV/HSI olour spae. In HSI I stands for intensity, S represents saturation (or
how vivid the olour is), while H hue enodes the dominant wavelength. Colour
is important in fae loalization software, beause if it is available it an help us
onentrateour searhonskinoloured areas. Furthermore,HSI istheolourspae
most often used in fae reognition systems, as studies have shown that dierene
inthe pereived skin olour between individualsof dierentethniitylies mostly in
dierenein intensity rather than hrominane [19℄. It would of ourse be possible
tothreshold the skin olourregions inRBG spae, but the region onstituting skin
olourmayrepresentaomplexvolumeorunionofsuhin3Dspae. Ifwetransform
the olourspae in suh a way that intensity represents anindependentdimension,
wean threshold the 2D hue/saturation planeinstead. Also,even thoughthere are
eient algorithms for thresholding skin olour regions diretly from RBG olour
spae,empirialevaluationshave found alternativeones (HSIand YCbCr)superior
[28℄.
6
systemsthatusemorethanonebiometrifeature
to transform them to HSI olour spae. The following simple formulas transforms
RGBintotoHSIspae. Here{R,G,B}representthethreebandsintheinputimage,
and {I,S,V}in bandsin the transformed image.
I = R + B + G 3
S = max(R, G, B) − min(R, G, B) max(R, G, B)
H =
"
acos .5 × ((R − G) + (R − B))
((R − G) ∗ (R − G)) + ((R − B ) ∗ (G − B))
!#
× 360 2π
Many faeloalizationapproahes has skinolourthresholdingasaninitialstep
intheproessing,eitherdiretlytondfaes,ormoreindiretlytoreduethesearh
spae. This is useful, as the property is fast to evaluate, invariant of rotation and
saling,and is fairly insensitive tohanges in lightintensity.
2.3 Neural networks
Abiologialneural networkis aninteronneted group of neurons. The term arti-
ialneuralnetwork(ANN) isusedformathematialmodels thatmimisome ofthe
fundamentalharateristis of the way informationis represented and proessed in
biologialneuralnetworks.
Before going on to desribe in some detail the elements and harateristis of
ANNs, I'll briey review the history of their development. Although ognitive neu-
rosiene 7
isaveryativeeldofresearh,mostaspetsoftheenodingandproess-
ingof informationinbiologialneural systems isyetto be fullyunderstood. In the
1940's,the synapse, and the hemial transmissionin the synapse wasa reent dis-
overy,andgaverisetotherst modelsofhowthe physiologialpropertiesofneural
ells allowed them to proess information. Donald Hebb wasamong the pioneering
researhers in this eld, suggesting that learning ould result from strengthening
the synapses between synhronously ring neurons. Coneptually important was
alsothe work of Friedrih Hayek, and his hypothesis that order in the brain ould
arise out of deentralized networks of simple units. These ideas in turn gave rise
to the rst working models of neural proessing. In 1943, MCulloh and Pitts
published theirartiial neuron,illustrated in gure1. Fourteen years later, Frank
7
An interdisiplinarybranhofneurosieneinvolvedintheneuralmehanismsof ognition
x 1
x 2
x n
y = f(a) w
w w
1
2
n
Σ
a = x w i i
Figure 1: MCulloh-Pitts neurons
Input Hidden Output
Figure 2: Simple multilayer perep-
tron
RosenblattdevelopedaworkingneuralnetworkmodelonsistingofMCulloh-Pitts
neurons,whihisknownasapereptron. However, afterinitialoptimism,ANN'sfell
intosomethingofanaademi disreputewhenMarvinMinsky andSeymourPapert
proved thatpereptrons are linearlassiers unableto solvelassiation problems
that are not linearly separable, suh as the simple XOR problem. Furthermore,
Minsky and Papert argued that even networks with more layers would not be able
tosolvethe XORproblem[15℄. Notuntilthe 1980's did ANN omebak intofavor
inaademia, when their onjeture was proven false.
2.3.1 Neurons - The basi building bloks
Theneurons ornodes are the basielementsof neuralnetworks. The meannumber
of onnetions feeding into a genuine biologial neuron is around 10000, although
thenumberishighlydependentonthetypeofneuronand whereitisloatedinthe
nervoussystem[5℄. Liketheirbiologialounterparts, artiialneuronsgatherinput
from neighboring nodes and pass it on, and an as suh be regarded as funtions
frommany dimensions toone.
In the nervous system, the eet one ring neuron has on another depends on
several fators. Among themare how farfrom the ellbody the synapse isloated,
the amount of neurotransmitters 8
released when the neuron res, and the number
of reeptor moleules in the reeiving ell. The two lastfators an be modulated
through learning. In the same way, the eet of one artiial neuron on another
depends on the weight of the onnetion between them, as in biologial networks.
The ativation passed on from the rst neuron is simply its ativity multiplied by
8
Chemialtransmittinginformationarossthesynaptigapseparatingtwoneurons.
sumof allinomingativation. In pereptrons,the ativationof aneuronyisgiven
by
a ( y ) = K h w, x i + b
(1)
wherex isa vetor of ativationof the neurons feeding intoy, wis avetor oftheir
assoiated weights, b is a onstant (bias), and K is a (usually nonlinear) funtion
suh as the logisti or tanh. The neurons are individually simple, and basially
onlyrelayinformationomingfromonnetedneurons. Itisthenonlinearresponse-
funtion of the individual neurons that makes the behavior of the network as a
whole omplex and diultto analyze. The non-linearity of the response funtion
ensures that the ativation of the neuron will never exeed a maximum value, in
turn ensuring the stability of the system as a whole. If the information that is fed
intothe system isorrupt insome way, or if anindividual neurons stop working as
normal,the impat is minimized. This fault tolerane is one of the most attrative
aspets of both biologial and artiial neural systems. The response funtion we
seletedforourimplementationwasperhapsthemostlassi,thelogistiorsigmoid
funtion:
f(x) = 1
1 + e − x
(2)2.3.2 Network topology
Thenetworktopologyrefers tothelayoutofthenetwork,essentiallywhihnodesare
onneted, and in whih diretion information ows along these onnetions. The
topology of the network is one of the most important parameters determining to
whatdegreeit isable tosolve agiven problem. Itis ommonto dividethe network
intolevelsor layers,where nodesinone levelare mostlyorexlusively onnetedto
nodesinthe neighboring levels. Thelevelthatreeivesthevetortobeproessedis
referredto as the input layer, and informationpropagates the network terminating
atthe output layer. Between thesetwo,thereareusually oneormorehidden layers,
referredtoas suh beausetheir ativity isnot generallyvisible. It isommonthat
informationows onlyinone diretionthrough the network, whihis referredto as
the network being feed forward.
Itislearthatanalmostinnitenumberoftopologiesisavailabletoaresearher
forany given problem. How many layers should be used? What exat interonne-
Shouldwe use reurrent onnetions, onnetionswhere informationan owaway
fromthe output layer? Fortunately, neural networksare fairly robust inthat many
topologies an learn the required mappings in a given task. However, the dierent
topologiesmay varyin terms of speed oftraining, and generalizationfrom training
totest data. Though the possibilities are endless, very few guidelines exist to help
selet topologies.
So let us look briey on whih topologial parameters an be important to the
suess of ANN. General number of onnetions; If topology is too small (in
terms of units and onnetions), the network may not be able to learn the desired
input/outputmapping. On the otherhand,if itistoolarge, thenetwork very often
generalizes poorly to new input [4℄. All the weights need to be set by the training
data,and moreweightstranslates intoalargerrequired trainingset. Eet of the
number of hidden neurons; Usually the network will have fewer nodes in the
hidden layer than in the input layer. This fores the network to extrat features
from the input vetor, and the nal deision (output layer) is based on a small
numberof variables[40℄. As with the general level of onnetivity, a largernumber
ofhiddennodeswillallowthenetworktoadaptbettertothetrainingdata. However,
aswith ageneral inrease inonnetions, this adaptation may be over spei to
the trainingdata, again resultingin apoorer generalization.
Number of layers; As desribed above, a (two layer) pereptron is simply a
linear lassier, meaning it an only solve a lassiation problem when a linear
funtion an separate the lasses in feature spae. Formany problems of any om-
plexity,thisisnotenoughandwemustresorttomorelayers. Alsointermsofhidden
layers oneshould limitoneself to asfewas possible,asgeneralizabilityand training
speed willbeaeted. However, oasionallymore than twolayers may be thebest
solution. This is typially the ase when very similar input vetors must result in
very dierent outputresponses. Withfourlayers, the rst hiddenlayeran beused
toreode the data beforepassing it on tothe seond hidden layer, whih performs
the required mapping onto the output nodes. Full vs loal onnetivity; In a
fully onneted network,eah node in alayer isonneted to every one in the layer
aboveit, and this is oftenthe default hoie. One popular optionis tostart with a
fully onneted network, and prune away onnetions by testing to see if network
performane deteriorates if the weights are removed. Empirial evaluations have
determinedthat loalonnetivity performs better at fae detetion tasks [9℄. Re-
urrent onnetions; Reurrent onnetions are of ourse the norm in biologial
lassesoftaskstrulyrequiretheseonnetions,andalsobeausealgorithmstraining
these networks are far less established. The weights of the onnetions distributed
through the network an be viewed as its long term memory, adjusted to learn a
given task from a training set. However, typially the network has no short term
memory,one the informationhas propagatedthe network it annotrememberthe
patternitwaslastexposed toorthe orderin whihtwopatterns were presented. If
thenetworkistohavesuhashorttermmemory,itmusthavereurrentonnetions.
2.3.3 Network training
Bynetwork trainingwe usually mean analgorithmfor adjustingthe weights of the
onnetions in order to minimize the output error relative to some riterion. In
supervised training, the network is presented with a training vetor for the input
layer, as well as a vetor ontaining the orret responses for eah training vetor.
The total error given a set of weights is usually formalized as the sum of squared
deviationsof the output nodes from the orretresponse.
E = 1 2
X
o
(t o − y o ) 2
(3)The development of the bak propagation algorithmfor trainingmultilayer per-
eptrons was a major reason for the revival of interest in ANNs. Although rst
proposed by Paul Werbos in the 1970', it was after the lassial artile in 1986,
Rummelhartet. al. in Nature, thatthe bak propagation algorithmbeamewidely
popular [24℄. Eetively, bakpropagation is a way to train multilayer pereptrons
usinggradient desent. Thealgorithmsderivesitsnamefromthefatthat theerror
propagatesthe networkfromthe outputlayertothe input layer, beforethe weights
are hanged. Gradient desent is an optimization algorithm that approahes a lo-
al minimum by adjusting the weights so that we move in the steepest downward
diretionalongthe error insmallsteps.
∆w ij = −ǫ[ δE
δw ij ]
(4)Thesizeof thesteps totakeisreferredtoasthe `learningrate
δ
. Ideallywe wanttoendup with network weightsthat leavesus atthelowest pointonthe errorsurfae.
Ahighlearningrate willusually meanthe trainingonverges faster,but if theerror
surfae is very eentri, taking large steps ould ignore important loal gradient
follows the gradient,itis very likelytoget stuk inapoorminimum, andto redue
the risk of this, a momentum is usually added to the movement along the error
surfae. As the name implies, this parameter is intended to give the algorithm
enoughmomentum to get out of smalldips in the error surfae.
In order to have a starting position on the error surfae, the network weights
are initially seeded with small random values. From there, training proeeds iter-
atively, either until the algorithmonverges{we are at a minimum, and there is no
further improvement, the error drops below a required value, or a maximum num-
ber of training iterations have been run. A host of other trainig algorithms have
been developed, suh as the rprop andquikprop, but bakpropagation remains the
workhorse of neuralnetwork training.
The error surfae is usually very eentri, and in genuine networks of any size,
we never end up at the global minimum. However, this is not neessarily suh a
bad thing. The network may still solve the problem admiraly and generalize well.
Furthermore, beause they usually never end up with the same nal weights even
whentrainedwiththesamedata,dierentnetworkswiththesametopologyrespond
dierentlytonewinput. This anbeexploitedbyombiningtheresponseof several
networks, usually referredtoas networkvoting. In otherwords, even if one ofthem
makesan error, the others an ompensate.
2.4 Geneti algorithms
A algorithm an be dened as expliit step-by-step proedure forproduing asolu-
tion to a given problem. Aording to this denition, and the way it is ommonly
usedinomputersiene, algorithmsshouldguarantee thatthe desiredendstatebe
ahievediftheproedureisfollowed. Dierentalgorithmsanvaryintheireieny
insolvingalass of problems,and an beompared byhow theomputer resoures
requiredby the algorithm,suh asthe numberof instrutionsortemporarystorage
spae,sales as the size of the inputinreases. This isformalized inO-notation
f(x) =
O u(x)
as x
→
a,provided that|f (x)| ≤ K |u(x)|
However, ertain lassesof algorithmssuh asprobabilisti algorithms,may not
stritly fall under the denition of algorithmsgiven above. In a probabilisti algo-
rithm, the probability that the result willdeviate from the desired end state when
thenumberofiterationsinreases an bequantied, andwilloftenonvergeto0 as
the number of iterationsinreases to innity.
ahieved,nor thatthe distanefromsuh astate anbequantied. Instead, geneti
algorithms represent a general way of moving towards better solutions in ertain
omplex optimization senarios. As the name implies, it borrows stylized elements
andoperatorsfromevolutionarybiology,andinessene,asimulatedmiroevolution
isrun on the problemdomain.
For most real world problems of any size, a omplete evaluation of the searh
spaewithanysigniantdegreeofresolution isprohibitive,and oftentherearefew
reasoned hoies as to how to move towards an optimal solution. Stritly optimal
solutions,astheglobalminimumonsomeerrorfuntion,maynotevenbeneessary.
What may be desirableis anon-arbitrarily improved solution.
Genetialgorithmsmay bepartiularlywell suited for omplexproblems where
heuristiormoreformallydened methodsof optimizationare not available, where
the tness funtion is so omplex that traditional methods frequently end up at
loal minima, the searh spae is large, omplex or poorly understood, or domain
knowledgeis sare orexpert knowledge isdiult toenode [4℄.
Beforeagenetialgorithmanbeimplementedforagivenproblem,twoelements
must be in plae. These are i) a way of representing a solution in suh a way that
itan bemanipulated by the algorithm,and ii)a way of determiningthe tness of
agiven solution.
2.4.1 Elements and operators in geneti algorithms
Genes; representing solutions. In GA a solution to a problem an be thought
ofas anindividual ina populationof solutions. The solutionstothe problemmust
beoded ina way that an beated uponby the evolutionary operators. As thisis
roughly analogous to the way DNA ontains the biologialblueprint of anindivid-
ual organism, the enoding of a solution is often referred to as a hromosome. In
DNA,tripletsof 4dierentkinds ofbases odefor 21distintamino-aids. Though
omputationalanalogues of suh omplex enoding shemes are possible and have
been implemented, usually asimpler enoding sheme is used, suh as a bit string.
The important pointis that this hromosomeodes for one unique solution.
Fitness funtion; evaluating solutions. As evolution is based on the sur-
vival of the ttest, we need some way of determining how t our solutions are,
andhopefully movetoward betterones during evolution. More speially weneed
a tness funtion. This funtion takes a hromosome, deodes it into a working
M M
Figure 3: Bit mutation
c1 c2
Figure 4: Chromosomal rossover
individuals, and evaluates it's tness. What onstitutes the tness of a solution
must be dened for eah problem or lass of problems, and how good the tness
funtion is will largely determine the suess of an evolutionary approah. Unless
the tness funtion is handled properly, GAs have a tendeny to onverge towards
aloalratherthan the global optimum of the problem.
Geneti operators; modifying solutions. Given than we now have ways of
enoding individual solutions, and an determine their tness, what we need are
evolutionary operators ating upon the individuals. Two main operators are used,
mutation and rossover,and their relative importaneis widely debated. Both are
illustratedin gure 2.4.1.
Mutation is perhaps the most widely known evolutionary operator. It basially
refers to small random alterations to the hromosomes. An important parameter
ofevolutionaryalgorithmsisthe rate of mutation,the balane between hange and
stability. If the rate is too low this may ause geneti drift, or onvergene to
loaloptima. Alternatively,ifthe mutationrate istoohigh,itmay leadtounstable
hromosomal behavior and jumping around in searh spae. However, exatly
whatonstitutesagoodmutationrateis hardlyever disussed,and forallpratial
purposes must be hosen based ontrialruns.
The seond evolutionary operator is that of rossover. The ells that fuse in
oneption eah have a single version of the 23 unique hromosomes in the human
genome. Thisfusionresultsinadiploidorganism,onewithtwosetsofeahhromo-
some. During meiosis 9
, when the sex-ells are generated, the pairs of hromosomes
separate, one migrating to eah of the two new ells. However, in the proess, the
9
A elldivision thatgenerateshaploid sex-ells,i.e. ellswithonlyonesetofhromosomes.
realignment of parts of the DNA threads. A similar proess is often implemented
ingenetialgorithms. When twoorganismsare seleted for breeding, theremay be
one or more points of rossover, where large segments of their DNA is exhanged.
Crossoverthusworksonamuhhigherlevelthanmutation,exhanginglargehunks
of DNA, and therefore alsolarger number of phenotypitraits.
Up to now, a fairly stylized but biologially reasonable set of operators have
been implemented. However, as the rates of mutation and rossover are hard to
alibrate, and the gene pool tiny ompared to that in say, a small drop of water,
youriskloosinggoodsolutionsfromthepool. Toensurethatgoodsolutionsarenot
lost,aproess referredtoaselitistseletionorelitism isused. Thismerely refers to
bypassing the regular evolutionary proesses, and simply opying a handful of the
best solutionsinonegenerationintothe next. Thiswayelitismanrapidly inrease
the performane of geneti algorithms,as it prevents the lossof goodsolutions.
2.4.2 Running a geneti algorithm
We are now in a position to put all the elements together into a working geneti
algorithm. First, a pool of (often random) hromosomes is initialized. A random
poolensures thatinitialsolutionsaredistributed overmuhof theparameterspae.
To ensure that large parts of the parameter spae are searhed, the pool should
ideally onsist of a large number of hromosomes, from hundreds to thousands.
The exat number is of ourse up to the individual researher to determine, and
will depend heavily on the time taken to evaluate tness. For eah generation,
the hromosomesundergo some degreeof mutation. Pairsofhromosomesare then
seletedforbreeding,theprobabilityofseletionbeingsomefuntionoftheirtness.
Mating results in a new pool of hromosomes, with hopefully better mean tness,
andthe proessan repeat itself. The algorithman terminateafteraxednumber
of generations or when a distane from a riteria is reahed, while no onvergene
tosuh aminimum distane is guaranteed.
2.4.3 Geneti algorithms for neural networks
With the number of synapses in the human brain estimated at 10E14 [37℄, and
the number of genes in our genome urrently thought to lie between 20-30 000, it
is unlikely that evolultion has shaped the topology of this network at the level of
individual synapses. However, on a slightly higher sale, the brain is the result of
neural iruits, it is not far fethed to try to shape them with the same kinds of
evolutionary proesses. In fat, applying evollutionary algorithms to various har-
ateristis of ANNs is not new, but has emerged as an established methodof both
seletingnetworktopologyandtrainingnetworks[4℄. Untilreently,seletingnetwork
topologywas done troughhuman expert knowledge, but this may hange. Shaer
et.al. have presented results showing that the ANN designed by evolutionary ap-
proah had better generalizationability than one trained by bakpropagation on a
humandesignedarhiteture[4℄. Furthermore,whilethebakpropagationalgorithm
isby farthe mostwidely used methodof trainingnetwork,itmay not alwaysbean
option. Trainingthroughgradientdesentapproahes requirestheerrorbedieren-
tiable,while evolutionary algorithmsan train networksregardless ofwhether they
are feedforward or reurrent, and even if error is disontinuous [4℄. Evolutioinary
trainingis also less likely than graidient desent based methodsof getting stuk in
aloalminimum,beauseitsearhes overseveral regionsinparallel. Unfortunately,
forthe same reasonit has diulties in ne tuning the parameters.
3.1 An overview of the algorithms
Though faedetetion has reeived a lotof attention the past 10years, ithas been
a eld of researh for many more. With the development of better algorithmsand
more powerful omputer hardware, the eld has moved from searhing for a single
vertial frontal fae in visually simple images with homogeneous bakground, to
multiple faes in diult lighting onditions, at various angles and distanes from
theamera,situatedinomplex visualsenes[19℄. Still,even today the majorityof
algorithmspublished are onlyable to detetvertial ornearly vertial faes.
Asoutlinedinthe introdutionof thisthesis, the ability toquikly androbustly
loalize faesin images is valuable innumerous areas, ranging tosurveillane, on-
tent based indexing and human mahine interation. For this reason there are a
multitudeof published algorithmsinthis area, eahone trying tolaim superiority
onrobustness or speed.
In the myriad of approahes, it is useful to distinguish two major ategories
[39℄[19℄. Oneassumesanapriorifaemodel(model-based/topdown),andattempts
to loalize/validate by means of an expliit interrelation of faial features (suh as
eyes, mouthand nose), while the other approah assumesno priormodel.
3.1.1 Feature-based/knowledge-based algorithms
Basially,infeature-basedmethodsforfaeloalization,weattempttousewhatwe
alreadyknow aboutfaes toonstruta modelwhihan be omparedtothe data.
When presented with a new piture,a feature based algorithmusually begins with
alow level analysis,where propertiessuh asedges, olour, motionand texture are
extrated [19℄. The onstellationof these features are then ompared tothe model.
Deformable templates,templates that tan apriori elasti modelto faialfeatures
are an example of a more sophistiated feature based approah. A problem with
several of these methods, is that the deformable templates must be initialized in
the viinity of the fae. Good fae andidates an be found by looking at ellipti
skin-olourregions.
3.1.2 Image based algorithms
The dening haraterizing of image based approahes is that we do not diretly
enode our preexisting knowledge about faialfeatures intothe algorithm. Rather,
Figure 6: Thresh-
olded
Figure7: Clustured Figure8: Cleaned
Figure 9: Highpass-
ltered
thesystem learns todisriminatefaes fromnon-faesthrough analyzinga training
set. The images are is mathematially ompared with the stored representation
withoutexpliitlyextratingfaialfeatures. Popularmethodshereinludeprinipal
omponentbasedalgorithms, neuralnetworks,supportvetor mahinesand hidden
markov models [19℄. Prinipalomponent basedmethodsare presented indepth in
setion4.1.2, while aneural network based approahis desibed in3.4.
3.2 Illustrating hallenges with a model-based algorithm
Implementing theneural networkbased algorithmpresented inthis thesis was ala-
borious proess, undertaken only afteran initialmodel-based approah proved not
to be suiently robust. Model-based approahes are often fast to implement and
exeute,are easytounderstand,andhaveanintuitiveappeal. Theydohoweverrely
onthe assumed faes onforming fairly stritly to the model, and small deviations
from ideal onditions will often result in failure. In this setion I will sketh the
model-based fae loalizationalgorithmwe abandoned, and give examples of some
of the many ways it an fail. We hope this will illustrate some of the diulties
involved in automati fae loalization. The model-based attempt is loosely based
onthe algorithm of Hesieh et. al.[12℄. The steps of this algorithm are, skin olour
thresholding,skinolourlustering,featureextration,templatemathing,andthese
1. Skin-olour thresholding.
Thesimplestonditionforfaeloalizationalgorithmsisasinglefrontalfaeagainst
adistintlynon skin-oloured bakground, suhasa typialpassport photo. Under
theseirumstanes,skinolourthresholdinganbealmostsuienttoloalizethe
fae. If an area is skin oloured, has roughly oval shape and ontains some darker
internal regions, we may lassify this as a fae. Figure 6 shows the result of skin
olourthresholding of gure 5.
Figure10: Clustered 2d histogram 2. Skin olour lustering.
In natural settings, skin olour thresh-
olding willhardly ever be suient for
loalizationbeausevariationsinbak-
groundhues willmake faesdiultto
segment based on olour alone. Often
two or more faes will be merged in a
single skin oloured region, as an be
seen ingure 6. To determine whether
an image area ontains several faes,
some sort of lustering is typially performed. In our initial approah, we devel-
oped a lustering algorithm for skin olour regions, the details of whih are given
in appendix A. Essentially the lustering approah involved identifying the peaks
of the 2D hue/saturation histogram 10
. After merging lose peaks, the histogramis
divided into regions aording to the eulidean distane to the enter of mass the
losesthistogrampeak. Clustering of the thresholded2D hue/saturationhistogram
an be seen in gure 10. Figure 7 illustratesthe eet of the lustering algorithm,
and gure 8 after some leaning. We an see that in this ase the lustering algo-
rithmhas suessfully separated the dierent faes, and atmost one fae ispresent
ineahskinolourregion. Formoreexamplesofskinolourlustering,seeappendix
A.Whilewe were pleasedwiththe performane ofthe lustering algorithm 11
, oa-
sionallyitwasunabletoseparate afaefromthe bakgroundorfromanother. This
wastypiallythease ifthe bakgroundolourwassimilartothatofthe fae. As it
onlysearhed for a single faein eah skin oloured region, one oroften both were
missedif the regionontained more than one fae.
10
peakswere denedas>0.9standarddeviationsabovethemean.
11
Seewebsiteformoreexamples
3. Feature extration.
Hopefully atthis stage, the image isbroken down into aset of regionsrepresenting
dierent objets or people, and the next stage is to determine whether any one of
theserepresentsorontainsafae. This isdoneby rstextratinglowlevelfeatures
fromthe regions. Eyes, nose andmouthareoftendarkerthanthesurroundingarea,
soaommonstep is tosearh for pathes ofpixels withlowerintensitiesthan their
surroundings[12℄. We found this approah tobe too unreliable, and easilyaeted
by shadows and normal variation in light intensity. Rather than look for darker
areas, we found high-pass ltering to be more robust. However, it beomes a hal-
lengeto detetfaes atvarioussales, beausehighpass lteringof larger faeswill
leave very dierent signatures than small ones. Dierent high-pass lters may be
best suitedto enhane thefaial featuresinlarge vs. smallfaes. Dierentfeatures
ould be used for skin-olourregions of dierent sizes, but the size of this regionis
not a perfet indiator of the size of the fae within. Although better than mere
intensity,thehighpasslteringwasalsosensitivetoedgesausedbyglasses,beards,
sharp shadows et.
4. Template mathing. One the features are extrated, one must determine
whether the onstellation of features found represented afae. Hsieh et. al. used a
templateoflightintensity roughlyrepresenting what one would expet given aver-
tialfae. i.e., two darkerobjetsinthe leftand rightupperquadrants,along dark
path in the lower middle et. Our approah was to t a triangle to the enter of
massof themost prominenthighpass-lteredobjetsintheinterior 12
of the region,
12
Searhrestritedtotheinteriorregiontoavoidhighpasslteringthebordertothebakground.
positive hits from our implemented system.
as it was assumed that the most prominent objets would represent the two eyes
and the mouth. If suh a triangle t well 13
, then it was delared a hit. Searhing
fortriangles representing faialfeatures is awidely used strategy [17℄. Usually itis
assumedthat thefaesarevertialornearlyvertial,andifthis isnotthe asethen
they will be missed. Approahes that require allfeatures to be found, will alsonot
detetfaes whereone of these are oluded.
Figure11illustratesthe suess of themodel-based loalizationalgorithmonan
easy image. We an see how the eyesand mouth are the most prominent objets
after highpass ltering, and the triangle ts perfetly. On the other hand, gure
12 shows examples of images in whih the model-based approah would result in
detetion failure. Fators that make these images diult is the lak of olour,
diultlighting onditions, rotation and olusion,and the vast dierenein sale
among the faes. One might argue that a fae reognition system does not need
to be robust under diult onditions. This is beause the urrent reognition
algorithmsare all so sensitive to the the same variations in lighting and olusion
thatarelikelytothrowoamodel-basedloalizationalgorithm. However,itwasour
experienethatoftenthealgorithmwouldfailevenunderseeminglygoodonditions.
Furthermore,even whenafaewasorretlydeteted,thisdidnot neessarilymean
thatthealgorithmwasabletoboundthefaesuientlywellforreognition. Often
itonfusedamouthwith the shadow underthe nose,or theeyeswith theeyebrows
or rims of the glasses, resulting in a bounding box overing either too littleof the
fae, or too muh of the bakground. For an example of loalization failure, see
appendix A.
13
thelengthofallsideswereroughlyequal
100 200 300 400 500 600 700 10
20
30
40
50
60
−30 0 −20 −10 0 10 20 30
500 1000 1500 2000 2500
100 200 300 400 500 600 700
10
20
30
40
50
60
−30 0 −20 −10 0 10 20 30
500 1000 1500 2000 2500 3000
Figure13: Horizontallineonguretotherightindiatereognitionthreshold. Trans-
lationand saling of even smallamounts leads to reognition failure.
3.3 The importane of aurate loalization for reognition
Fae reognitionalgorithmsare oftentested onvariationsinlighting, pose or faial
expression relativeto the storedtemplate, inan eorttodetermine howrobust the
algorithmis. However, rarely doauthorsexplore howsensitive thesealgorithmsare
tosub-optimal segmentation of the fae area. In other words, howaurately must
the oordinatesof aboundingbox aroundafaebeinorder tobeable toreognize
the individual, and how does performane degrade as a funtion of dierent kinds
of sub-optimal loalization. As we had deided to use the eigenfae approah 14
for
the reognition we rst wanted to get animpression of how sensitive it was to the
aurayof the loalization.
A prototypeof aeigenfae based reognition system was implemented inMAT-
LAB.Wethenseleted 31imagesfromtheStirlingfaedatabase [29℄,andmanually
found the oordinates of the faes. Reognition rates were evaluated as a funtion
of systemati variations in these oordinates 15
. The kinds of variations performed
inluded horizontal, vertial and diagonal translation, rotation and saling. Figure
13illustratestheapproah,andthegraphtotheleftshowshowhangeinpositionof
theboundingboxinreased thedistane fromthe storedtemplate 16
. The prototype
indiated that in the images we used, and at asale of
64 × 64
pixels, a vertial orhorizontal translation of more than 4 pixels usually resulted in a mislassiation,
as did rotation > 6 degrees. The exat numbers are not important, but our trials
14
Thedetails oftheeigenfaebasedapproaharegiveninsetion4.2
15
Faesweredownsaledto64
×
64pixelsbeforereognition16
Themaximumdistaneforreognitionwassetto1.4E3,basedoninitialtrials
and the boundingbox does not have to be muh ofor the algorithm tofail.
A neural network based approah willusually indiatehits inmultiple adjaent
loations. Given the results above we therefore propose not to searh overlapping
fae hits for the single best math, but rather attempt to reognize eah box indi-
vidually. Thismay inreasereognitionrates, ashanes arethatatleastone ofthe
boundingboxes suientlywellsegmentsthe faeforthe individualtobelassied
orretly. On the other hand, false positives or poorly segmented faes will not be
reognized,and willbedisarded duringreognition.
3.4 Rowley's neural network-based approah
Figure 14: Topology of Rowley's fae de-
tetion network Although a multitude of neural net-
work based approahes to fae loal-
ization have been published, Rowley,
Baluja and Kanade are usually red-
ited with the pioneering algorithm. A
neural network-based upright frontal
fae detetion system was published
in 1998 [22℄, and later the same year
it was extended to handle faes ro-
tated in the image plane [23℄. Subse-
quentapproahesusuallyborrowoneor
more fundamentalelements of their al-
gorithm,themajorones being animage pyramid, arouternetwork and adetetion
network.
The neural networks reeive their input from a 20
×
20sliding window enteredaround every pixel in the image. A 20
×
20 sliding window has been popular alsoinvetor mahine and likelihood-basedapproahes tofaeloalization[25℄. As this
window is of xed size, the only way to detet faes at dierent sales is to san
the image at inreasingly lower resolutions. Therefore, given an input image, an
image pyramid is rst onstruted. In Rowley's implementation, this pyramid is
onstrutbysub-samplingtheoriginalimage. Ateveryloation,theslidingwindow
is histogram equalized before it feeds into the router network, whih is trained to
estimate the amount of lokwise rotation of an area. The histogram equalization
is performed to ompensates for dierenes in amera input gains, and improves
Theoretially, rather than using a router network, eah loation ould be sub-
jetedto35detetionnetworks,eahtrainedonaspeifaeorientation. However,
theneuralnetworksare fairlyomputationallydemanding,andthisapproahwould
requireamultipleof35sans omparedto therouter approah. Forthe routernet-
work, Rowley uses a fully onneted three-layer pereptron, with 400 input nodes,
15hidden nodes, and 35 output nodes. The network is trained on images of n
×
10degrees rotation. The most ative output node represents the estimated amount
oflokwise rotation. The window an then berotated antilokwise the estimated
numberofdegrees. Abivariatelinearregressionisusedtoreduetheimpatofdire-
tionallighting. Followingthelighting orretion,thewindowishistogramequalized
one again, and presented to the detetion network. As the name implies the task
of the detetor network is to lassify a given window as a fae or non-fae. The
topologyof the detetor network is skethed in gure 14.
Itis learthat this approahisveryomputationallyintensive. A800x600 pixel
image will give rise to more than 750 000
20 × 20
pixel windows to san. Eahwindow must be equalized, rotation estimated, rotated bak, tted to a bivariate
linearregression, equalized again and run through several detetor networks. Row-
ley'srouter network onsists of 6530 network edges, so a single rotationestimation
will quikly involve more than 10000 oating point operations. As well as being
omputationallyintensive,thevastamountsofwindowssanned, usuallyonlyafew
ontainafae. Thelargenumberoftrialsandthelowbaserateanquiklytranslate
intoalotoffalsepositives. Toounteratthis, Rowleyusesthe followingstrategies;
anarbitration sheme, bootstrap training and network voting.
Their arbitration relies on simple heuristis, suh as disallowing overlapping
faes, and requiring multiple hits in the same image region. Bootstrapping in this
ontext refers to running the algorithm on images without faes, and adding the
falsepositivesto the training set. Another method of reduing the number of false
positivesis using several detetion networks, eah trained withdierentinitialran-
dom seeds to validate the presene of a fae. Howthe deisionsof these individual
lassiers in turn are treated has been the subjet of some investigationin the lit-
erature(see ex. Hjelmås [11℄). The network voting involves AND'ing the response
of two networks trainedondierent initialrandom values.
Our approahuses the main onstituent part of Rowley's algorithm,while we have
attemptedto reduerunning time. This isprimarily done through:
i) using a skin olour and image texture model to substantially redue the searh
spae.
ii) Usingan evolutionarystrategy to reduethe numberof edges inthe routernet-
work.
3.5.1 Preproessing: Colour transformationand onstruting the image
pyramid
Our program aepts unompressed TIFF [1℄ images as input. When the program
is run, the image is rst heked for olour information. If the input image is in
RGB format, itis transformed intoan HSI olour spae. The details of this olour
transformationsare given in setion2.2. The resultingimage has three bands. One
onsisting of hue (H), one of saturation (S) and the last of pixel intensities (I). In
ourimplementation,theolourinformationissolelyusedtodisardnon-skin-olour
regions. The detetion step itself is olour blind, and operates exlusively on the I
bandof theimage. WhereasRowleyusedasub-samplingapproah inonstruting
the image pyramid, we eventually settled on bilinear interpolation in downsaling
theimage. Asub-samplingapproah resulted inlargedeteriorationonhigherlevels
inthepyramidandpoorerdetetorperformane,whiletheextraproessingrequired
for the interpolation is minusule ompared tothe other omputationallyintensive
steps inthe algorithm.
3.5.2 Optimizing the searh by using olour and texture
The main limitation of a neural network based approah is that it is very ompu-
tationally demanding. Rowley's 1998 implementation used 383 seonds to san a
320
×
240 pixel image for vertial faes17. With the more densely onneted routernetwork, and additional operations of rotation and equalization, the time for de-
teting rotated faes will easily be a multiple of that for vertial. In fat, it will
beshown insetion 6that ompensatingfor faerotationinreases proessingtime
in our implementation by up to 500%. If the algorithm is to be genuinely useful
intraversing large databases, the runningtime must be redued substantially. The
17
Benhmarksperformedona200MHzR4400SGIIndigo2system
arding areas of the image surfae where the presene of a fae is highly unlikely.
Withalmost all running time of the algorithmtied up in the neural networks, dis-
arding a perentage of the image results in an almost linear redution in running
time. As an beseen ingure 3.5.2, often substantialregions of the original image
an be ignored if the image is thresholded on texture and olour. While several
authors have suggested using skin olour to redue the searh spae, we have not
seen a similar argument presented for texture, although numerous papers present
texturebased methodsfor loalizing faes[19℄.
Empirialstudies havedemonstrated thatskin olour of people of dierent eth-
niity varymostly inlightintensity,andrelativelylittleinhromatiquality (under
normal lighting onditions). Skin olours were dened as an area of the HS plane,
withhue valuesrangingbetween <0,130>, saturationvaluesbetween<0,70>,(and
anintensity level >40).
Colour an be used in several ways, either by disarding non-skin olour areas,
or by lowering the detetion threshold in these areas. In our approah, we simply
disardthose areas of the image that fall outsidethe thresholds.
Ifanimageontainsasingle largefae,itwillstillbesanned onmany dierent
resolutions,mostof whih aremany levelsbelowthe one onwhihitisdeteted. In
fat, by far the most windows willbe sanned in the lowest levels of the pyramid.
At these bottom levels
20 × 20
windows entered over an area of the fae willontainskin-olouredpixels, buthaveatexturethatlets usdisardit. Furthermore,
in graysale images, whih is still the norm in most available fae databases, we
obviously annot use skin olour. At the
20 × 20
pixel levels at whih faes aredeteted, they have textures that set them apart from most non-faes, so prior to
detetiontheimageanbethresholdedontexture. Inseletingtexturestoalulate,
wemust of ourse hoose ones that are able to disriminatewell between faes and
non-faes. However, as up to a million windows must be evaluated for eah image,
goodtexture andidatesmust alsobefasttoalulate. Basedempirialevaluations
weseleted variane and entropy.
Varianeis here dened as:
1 n − 1
25
X
i =1 25
X
i =1
X i − X ¯ ) 2
(5)This property is used beause under normal lighting onditions faes have areas
with lower light intensities (eyes, mouth, shadow under nose et.), and better lit
Figure 16: Texture
mask
Figure17: Colour mask
Figure 18: Combined
mask