Implementing an automatic face recognition system

(1)

UNIVERSITETET I OSLO Institutt for informatikk

Implementing an automatic face recognition system

Hovedoppgave

Nikolai Czajkowski

1. august 2006

(2)

The goal of this projet was toimplement and evaluate anautomati fae reogni-

tion system. Fae loalization was adapted from the algorithm of Rowley et.al. A

20 × 20

^pixel ^sliding ^window ^sans â ^pyramid ôf ^the înput îmage, ^feeding înto^sev-

eral neural networks. The router network estimates rotation, while three detetor

networks lassify the window as a fae or non-fae. This approah is very ompu-

tationally intensive, and estimating rotation is the most demanding aspet. Our

ontributiontothe loalizationalgorithmis toreduethis proessingtime by i)us-

ingagenetialgorithmto evolve a moresparsely onneted routernetwork, and ii)

tolimitthesearhspaebythresholdingtheinputimageonskinolourandtexture.

An eigenfae based algorithmwas used for fae reognition. Our ontributionwas

to have the system automatially extrat multiple and dierently segmented faes

fromthe trainingimage, rather than manually segmenting a single fae.

Evaluationoftheloalizationalgorithmdemonstratedthatitisgenerallyrobust,

abletoorretlyandauratelysegmentfaesdespitevariationsinlighting,rotation,

sale,faialexpression andolusion. Furthermore,althoughthedierenebetween

networktopologieswassmall,theevolvedroutertopologyimprovedspeedbyroughly

13%, while improving the loalization rate ompared to Rowleys fully onneted

topologyby3%. Texture thresholdingredued therunningtimeofthe algorithmby

almosttwothirdsinbotholourandgraysaleimages,whileskinolourthresholding

independently redued the running time by 45%. In olour images the ombined

eet was aredution in proessingtime of 75%.

Benhmarking of the reognition algorithm demonstrated that eigenfae-based

reognitionisverysensitivetovariationsinfaesegmentation relativetothestored

template. However, beause automati segmentation is likely to vary somewhat

omparedto amanualsegmentation of the same fae,performane of the eigenfae

based approah an be improved by automatially extrating and storing several

slightlydierently segmented templates per person.

TheprogramwasimplementedinC,andnewlibrariesforneuralnetworks,image

proessing,TIFF reading/writing and geneti algorithmswere written tomake the

system fast and exible.

(3)

I would like to thank my supervisor Anne Solberg for a patiene that has now

spanned a onsiderablenumberof semesters, and for muh helpful advie.

Iwould alsoliketothank myfriends, family,and girlfriend. I'mlookingforward

toseeing youallagain.

Oslo, July 2006

Nikolai Czajkowski

(4)

Aknowledgments . . . 3

1 Introdution 6 1.1 Biometris . . . 7

2 Introdution to the onstituent parts of the algorithm 11 2.1 Fae pereption, brain and biology; what an be learned, and what should be used? . . . 11

2.2 Colour and olour spaes . . . 14

2.3 Neural networks . . . 15

2.3.1 Neurons - The basi buildingbloks . . . 16

2.3.2 Network topology . . . 17

2.3.3 Network training . . . 19

2.4 Geneti algorithms . . . 20

2.4.1 Elementsand operatorsin genetialgorithms . . . 21

2.4.2 Running ageneti algorithm . . . 23

2.4.3 Geneti algorithmsfor neuralnetworks . . . 23

3 Fae loalization 25 3.1 An overview of the algorithms . . . 25

3.1.1 Feature-based/knowledge-based algorithms . . . 25

3.1.2 Image based algorithms . . . 25

3.2 Illustratinghallenges with a model-based algorithm. . . 26

3.3 The importane of aurate loalizationfor reognition . . . 30

3.4 Rowley's neuralnetwork-based approah . . . 31

3.5 Our fae loalizationapproah . . . 33

3.5.1 Preproessing: Colour transformation and onstruting the image pyramid . . . 33

3.5.2 Optimizing the searh by using olour and texture . . . 33

3.5.3 Trainingthe networks . . . 36

3.5.4 Usinganevolutionaryalgorithmtoimprovethenetworktopol- ogy . . . 38

3.5.5 Reduing falsepositives . . . 41

3.5.6 Putting it alltogether: Loalizing afae . . . 42

3.5.7 Other possible optimizations . . . 44

(5)

4.1 An overview of the algorithms . . . 46

4.1.1 Feature based methods . . . 46

4.1.2 Image based methods . . . 47

4.1.3 Reognition by eigenfaes . . . 47

5 Implementation details 49 5.1 TIFF reading/writing library . . . 49

5.2 Image proessing library . . . 50

5.3 Neural network library . . . 50

5.4 Geneti algorithmlibrary. . . 52

6 Benhmarking 53 6.1 Fae loalization . . . 53

6.1.1 Benhmarking set . . . 53

6.1.2 Soring hits and misses . . . 54

6.1.3 Trainingdata, algorithmparameters and test system . . . 55

6.1.4 Benhmarking results . . . 56

6.1.5 Disussion and omparison to othersystems . . . 60

6.2 Fae reognition . . . 61

6.2.1 Benhmarking set . . . 61

6.2.2 System setup . . . 62

6.2.3 Benhmarking results . . . 62

6.3 Comparisons toother systems . . . 64

7 Conlusions and personal omments 65 7.1 On amore personal note . . . 66

7.2 Conlusion . . . 68

8 Appendix: HS lustering algorithm 68

9 Appendix: Sample images from model-based loalization 70

10 Appendix: Sample images from fae loalization benhmarking 70

(6)

In 2001 the ity of Tampa, Florida hosted the annual Amerian Football league

nal,the SuperBowl. That year, in additiontothe regularmediafrenzy,the nals

reeived anextradose ofmediaattention asityoialshadinstalledanautomati

faereognitionsystem,FaeIt,inthestadiumentrane. Camerasreordedpeople

entering, FaeIt loalized their faes in the video stream and ompared them to

images ina polie database. The initiative slingshot biometrisystems into publi

awareness, and earnedtheitythe Worst publioial awardfromPrivay Inter-

national. Similarsystems have sinebeen installedinahandful ofloations,but so

farthey havenot ledtoany arrests. Thelak of arrestshighlightsthe sobering fat

thatdespitethereentupsurgeinpubliimaginationandavailabilityof ommerial

faereognitionproduts,thistehnologyhas yettomature,and robustreognition

remainsa hallenge.

Perhapsheremorethaninanyotherareaofmahinevision,therelativeeasewith

whihpeopleperformthetaskomparedtoomputersbeomesapparent. A frontal

faein your eld of vision potentially indiatesthat anindividual has its attention

diretedatyou, andthis ouldbeanimportantevent. Also,formoresoialanimals

likeus,itisvitaltobeabletoreliablydistinguishbetweenthemembersofourgroup.

Asvisionisourprimarysense,itisnaturalthatthisdisriminationisbasedonvisual

pereption. Faes are suiently omplex to ontain the neessary information

to distinguish individuals, and even ontain informationabout the person's mood.

As fae reognition is so important for us, many have speulated that speialized

neural strutures have evolved to failitate this task. The degree to whih people

are endowed with suh speialized mental modules is stillheatedly debated, and

we will briey return to it later. What is ertain however, is that regardless of

howspei the neuraliruitsare, already shortlyafterbirth hildrenare oriented

towards fae-likevisual stimuli[27℄. Byfae-like werefer to patterns haraterized

by symmetry and a ertain degree of omplexity. So, from the very beginning and

throughout life, individuals are exposed to a massive amount of faes. Combined

withpossiblehard wired neural iruits,this experieneresults infaes being the

lass of stimuli where we have the nest level of disriminability. As with many

tasks in pattern reognition, what seems almost eortless to humans has proved

very diult to implement robustly in a omputer, even though automati fae

detetion and reognition has been an ative area of researh for almost 30 years

[42℄.

(7)

in the algorithm, suh as olour spae, neural networks and geneti algorithms. I

assume the reader is familiarwith these subjets already, but a short introdution

willlater make it easier to disuss their purpose in the implemented system. Any

system that attempts to reognize a fae, or try to respond to the presene of one

insome way,must rstnd itinthe image. Inpratie thisinvolvessegmenting the

faefromthebakgroundbyndingasexatlyaspossibletheboundingoordinates.

While loalization and reognition may be intertwined in human fae reognition,

itpresents a natural divide in automated algorithms, and I have hosen to present

these steps of the algorithm separately in hapters 3 and 4. Chapter 5 gives a

summary of how the ideas outlined in the previous hapters are implemented into

a working system, as well as presenting some major design deisions. In hapter 6

results from benhmarking of the system are presented. In hapter 7 I reet on

some of the more general lessons learned from implementing the system the way

that we did.

1.1 Biometris

Biometris(anientGreek: bios ="life",metron ="measure")is the

study of automated methods for uniquely reognizing humans based

uponone ormore intrinsiphysial orbehavioraltraits [36℄.

The denition given above ontains two riteria that must be fullled by any

biometrisystem. It must be automated in the aquisition of data, the extration

of relevant features, and the lassiation. Seondly, the features where by this

lassiationisperformedarephysialor behavioral. Thesefeaturesan beaquired

fromoneormorephysialdomains,suhasngerprints,iris,voie,fae,bodyshape,

weight et. Howthe system subsequently proesses the data depends onwhat kind

of data it has olleted, as well as the purpose of the system. In terms of purpose,

it is ommon to dierentiate between identiation and identity veriation. An

identiation appliation suh as the FaeIt system introdued above, attempts

to determine the identity of a given individual. On the other hand, an identity

veriationsystemattemptstoverifythattheindividualreallyiswhohe/shelaims

tobe.

Before proeeding, let us refresh some fundamental onepts in lassiation

theory. In fae loalization, the task of the lassier is to determine whether a

feature vetor represents a fae. The lassier an ategorize the input orretly,

(8)

makesanerrorit aneither beby indiatingafaeispresent whenit infatisnot,

false positive, orlaimingthat it isnot present when itis, false negative.

Often there isa trade-obetween the degreeof ertainty of abiometrimethod

andhowinvasiveitisinaquiringtheneessaryinformation. DNAanalysis,whihis

beominginreasinglyautomated,andan underertainirumstanes beregarded

asabiometrisystem,willidentifyapersonwithvirtuallynoleveloferror. However

the proedure requires a physial sample of the individual. So, biometri systems

dier in many respets, how do we ompare alternative approahes? Biometri

systems an be ompared onthe following riteria[14℄:

•

Universality: towhat degree everyone posses the neessary features.

•

Uniqueness: whether the physial features uniquely distinguish between dif- ferent individuals.

•

^Permanenê: ^whether ^the ^physial ^features ûsed âre învariant ôver ^time, ôr

an be alteredby the individual.

•

Aeptability: towhat degree people willaeptbeing subjeted to analysis.

•

^Colle^tability: ^howêasy îtîs ^to ôllet^the ^neessary ^data.

•

^Performanê: ^how ^well ^the ^best âlgorithms ^perform ⁱⁿ ^terms ôf ^speed, âu-

rayand robustness.

•

Cirumvention: howeasy it isto foolthe system.

As an example, take ngerprints. They have fairly high universality, although

sarring an interfere with lassiation, and an estimated 5% of the population

does not have elligable prints [8℄. Fingerprints are to suh an extent unique that

even idential twins dier, as do the prints of the same individual aross ngers.

Furthermore,ngerprintvarylittleovertime,andurrentbiometrisystemsanuse

themtolassifyindividualswithverylowlevelsoferror. However, ngerprintshave

some aeptaneissues, as they are usually assoiated with riminalinvestigations.

Also, in olleting the data you need the individuals ooperation, or at the very

least, awareness and attention. On the other hand, fae reognition an be arried

out from a distane, even without the individual being aware. So universality,

aeptability and olletabilityof faereognition systems arehigh. However, faes

(9)

Furthermore, they vary over time as people wear glasses, grow beards or put on

makeup. The performane of state of the art algorithmsistherefore stillfairly low.

Simplied, any biometri system onsists of two major omponents. First fea-

turesmustbeextrated,thenthey must beomparedwithtemplates inadatabase.

One data of a ertain physial domain has been sampled, and before the lassi-

ation an be arried out, there is usually some preproessing required. When

a ngerprint sanner gathers data, some translation or rotation may be neessary,

but otherwise the data is fairly well segmented. This is not the ase for most fae

reognition appliations. Their input is usually a two dimensional image, and the

rst task of the system will invariably be to nd as aurately as possible the o-

ordinates of any faes present. Findingfaes in images may seem trivial, after all,

we all know what they look like. However, I will throughout this thesis attempt

todemonstrate that loalizationis in fat a hallenging task, at least as the visual

sene the individual is in beomes more omplex and the lighting onditions less

ideal.

Some of the fators ompliating the task of fae loalizationwill be disussed

and illustrated insetion 3.2. Fornow it issue tosay that the shapeof the fae

an ause two-dimensional representations to vary onsiderably when light is ast

fromdierent angles. Furthermore, olusion by objets suh as igarettes, glasses

andin turntheir shadows makesdetetion stillmore diult. Also,faesan be at

any distane or orientation relativeto the amera.

If loalizationissuh a hallenge,is itworth the eort? Why would we need to

loalize people in images? The most obviousand perhaps most proled use of this

tehnology is surveillane and seurity. Given the esalating onerns with airport

seurity and border ontrols, the tehnology has at least in these setors enormous

eonomipotential. Butthereare otherand perhapsmorebenignusesfor thesame

tehnology that reeive less attention. With the steep growth in sales of digital

ameras, the integration of inreasingly high quality digital ameras in embedded

eletronis suh as mobile phones and PDA's, digitization of video arhives, pro-

liferation of broadband aess, vasts amounts of video and images are beoming

available. Yet if this rise in availability is to result in a genuine inrease in aes-

sibility, then eah image must be labeled and tagged with ontent informationfor

retrievaland searhing. Unless suhtextual informationregarding ontentisstored

alongwith this data,itwillinreasinglybe rendered inaessible,lostinajungle of

information. The sheer amount of images to searh manuallybeomes prohibitive.

(10)

In CBI, a database is indexed not as onventionally done through a tag, verbal

desriptor or elementary data type, but by the atual ontents of the multimedia

item. In other words, rather than having a tag desribing the ontents along with

the data, ex. ar, the information is referened through features extrated from

thedataitself. Thereasonwhyfaeloalizationandreognitionalgorithmsarepar-

tiularlyrelevant to CBI, is that automati faereognition is one of few attempts

to lassify objets that an vary substantially over time and situations, but where

the dierenes between the individuallass members an be very small. I.e. a per-

son an hange faial expression, grow a beard or wear glasses, but the dierene

between individual people is relatively subtle. Fae reognition is therefore par-

tiularly diult, ompared to automated reognition of rigid objets suh as ars

or airplanes. Also, beause the relevant features needed to dierentiate individual

faes an hange under dierent onditions, most of the proposed algorithms are

abletolearnfromaset of trainingexamples. They arethereforesuiently generi

tobe appliedto loalizationorreognition of most other lasses of objets, given a

dierent trainingset.

Human mahine interation is the nal use for fae loalization/reognitional-

gorithmsIwillmention,anditisonewhihislikelytoinrease inimportaneinthe

near future. Faes are not just any lass of stimuli. We expet feedbak to be di-

retedtoourfaes, and our expressions alsoonveya lotof informationbeyond our

identity. Therefore, ashuman-omputer interation beomes more widespread, the

need for robust ways to detet, reognize and interpret the nonverbal information

present infaes willbeome more important.

(11)

rithm

2.1 Fae pereption, brain and biology; what an be learned,

and what should be used?

Engineers are ofteninspiredby the elegane andfuntionalityof biologialsolution

todiultproblems. However, astheyareworkingwithverydierentrawmaterials,

onvastlydierenttime salesandlevelsofomplexity,itisnotgiventhatbiologial

solutions an or should be applied to any given engineering problem. However, as

willbeevidentthroughoutthisthesis,several aspetsofthepresented algorithmare

inspiredby biologial systems and proesses. Also, as fae reognition is a skill all

healthy hildren develop seemingly withouteort, itmight beuseful tosketh how

this is aomplished by the human brain. This setion therefore gives a very brief

introdution to the funtional neurobiology of pereption, before desribing some

relevant and importantndingsin neuropsyhology.

Wean denereeptive eld of aneuronastheregionofthe visualeldinwhih

stimuliwill aet its level of ativity. As one moves up the neural hierarhy, from

the sensory ells to assoiation areas of the neoortex 1

, the reeptive elds of the

ells beome larger, and the stimuli that elit their response beomes inreasingly

omplexand abstrat [5℄.

The neural proesses of reognition begin in the retina, where light foused

throughthelenshitsphoto-pigmentmoleulesinthespeializedellsliningthewall,

ausingthese moleules tobreak down. The onstituent parts of the breakdown in

turnause ionhannels 2

tohangeshape,allowingan inuxof Na

+

,ultimatelyre-

sultinginaneletrialsignal. However, the retinadoesmorethansimplytransform

lightto aneletrohemialanalogue. A large amountof low level signal proessing

is performed in the retina. Information is arried from the eyes through the opti

nerve, whih is formed of speial ells alled ganglion ells. Due to proessing in

the retina,ativity of ganglionells does not merely represent the presene of light

at a partiular point in our eld of vision. Rather, these ells respond maximally

todierenes inintensitybetween the enter and the surroundingin theirreeptive

eld. This way retinal ganglion ells onvey information about disontinuities in

1

phylogenetiallynewerregionsofthebrainthat areneithermotororsensorybutarethought

tobeinvolvedinhigherproessingofinformation.

2

omplexmoleulesin theellmembrane

(12)

Through the opti nerve, informationis relayed through parts of the mid-brain

to the primary visual ortex. Still, in the region that the opti nerve feeds into

the ortex, the ring of the neurons represent very basi and fairly understandable

aspets of the visual stimuli. However, as one moves away into the surrounding

assoiative ortex, more abstrat and integrative qualities of the visual stimulus is

proessed(as istrue for all otherprimary sensory ortial areas).

However, one of the broad distintionsusually made invisualproessing isthat

of the ventral and dorsal stream 3

. The ventral stream passes information to the

temporallobe,andismostly onernedwith shape(objet)reognition. Thedorsal

stream follows a path up toward the parietal lobe, and is mostly responsible for

loalizingobjetsinspae. Aswefollowthedorsalstreamfurthertowardsthemotor

areasofthebrain,theneuraltissuegraduallyproessesourreation toobjets, suh

asreahing forthem. Thesetwo aspets,loalizationand reognition,antherefore

be dissoiated if one of the areas is damaged. A person may reognize an objet,

but isunable toquikly and aurately reah for it,orvia versa.

The introdution given to neural proessing of visual information is of ourse

onlyanabstration. Theoptinerveismadeup ofroughlytentimesasmanybers

astheohlearnerve 4

,andalargepartoftheortexandmanysubortialareasare

insomewayinvolvedintheproessingofvisualstimuli. Thestrutureof theortex

is far more homogeneous than the phylogenetially older neural areas. Therefore,

purely strutural and physiologial knowledge gives limited understanding of how

and what informationisproessed.

Mostknowledgeofnormalbrainfuntioning inhumansomesnotfromknowing

itsstruture, but fromstudying damaged brains. Broadlyspeaking the three main

souresofsuhdamageisshrapnel fromwarwounds, strokesandtumors. However,

omaredtowhatweknown aboutdamagetomotorareasintheparietallobe,muh

less is known about damageto the highervisual areas. The blood vessels that feed

theparietalmotorareas aremuhmoreonvoluted andvulnerableforrupture than

those of the oipital lobe. Shrapnel damage to the visual areas (oipital lobe)is

alsolikelytodamagebrainstemareasontrollingbreathingandbloodpressure,and

resultin fatality.

Damage to the eyes will of ourse result in blindness, and the same is true of

3

Ventralanddorsalaremedialtermsindiatingtowardsthefront andtowardsthetopand

bak,respetively

4

auditory nerve

(13)

higherintheproessinghierarhy anresultinlossofhigherlevelfuntions,suhas

olour blindness or inability to pereive movement. Agnosia is a term referring to

thelossof abilityto interpretthe stimuliinagiven sensory modality. Forexample,

apatient withvisual agnosia an experiene allthe onstituent elements ofperep-

tion,suh asolour, texture,edges, movement et., but isunable tointegratethem

all into a meaningful whole. One rare but highly studied version of this disorder

is that of prosopagnosia, thought to be a seletive impairment in reognizing faes

resultingfromdamagetoertainareas oftheventralstream. Apatientisoftenable

to judge orretly the expression of faes and to disriminate between instanes of

other lasses of objetssuh vegetables and fruit. Prosopagnosia is among the evi-

deneoftenitedasindiatingdediatedneuraliruitsevolvedforfaereognition.

However, just how seletive the impairmentis is a matter of some ontroversy [7℄,

as most patients have redued performane on many tasks requiring subtle visual

within-ategory disrimination. Fae reognition is perhaps the most omplex vi-

sualtaskpeopleroutinelyperform, andsuddenlyfailingtoreognizetheirspouse or

hildrenisertainlythemostnotieableanddebilitatingresultofthebraindamage.

Fora further disussion see [2℄ or[13℄.

While muh is known about the low-level visual system, neurophysiology has

not signiantly guided fae reognition algorithms beyond providing a biologial

justiation for using gabor lters and feed-forward neural networks 5

[26℄ . But

althoughtraditionalneuralnetworksareinspiredbygenuineneurons,they represent

onlyaverysimpliedmodel. However, more biologiallygrounded approahes that

mimi the biology of low-level vision have been published [26℄. So, what of the

higher order properties? One important question that ould impat the design of

ouralgorithmistowhatdegreefaepereption reliesonindividualfaialfeaturesor

anholistievaluation. Dofeaturesaseyes,noseandmouthindependently ontribute

toreognition,orisfaereognitionperformedonthefaialareaasawhole,without

breakingitdownintofeatures. Thisquestionisimportant,asitmirrorstwodierent

strategies in biometrisystems. We willlater desribe the eigenfae approah that

we have implemented, and this approah does not extrat features from faes, but

simplyompares two image areas.

Lastlyitshouldbementionedthatthebrainusesallinformationavailable,rang-

ingfromothersensorymodalitiestotheontextingeneral. Faesareloalizedfaster

if they are in a familiar ontext, and reognition is better if we an hear the per-

5

Duringtherstfewmilliseonds,visualproessingisthoughttobefeed-forward.

(14)

multimodalbiometrisystems 6

has superior performane ompared toonventional

systems, and isgaining popularity, [16℄.

2.2 Colour and olour spaes

A olour spae, is a mathematial way of representing olours as data, typially

as three or four values or olor omponents [34℄. Many dierent olour spaes are

dened, and the one to use for agiven task depends on the purpose of the system.

This is onveniently illustrated with the RGB olour model, arguably the most

familiar one to most omputer sientists. In traditional CRT sreens olour was

generated by energizing tiny dots of red green and blue oloured phosphorous with

aneletronbeam. Thesethreeoloursinturnstimulatethreeseparate typesofells

inthehumanretina,anddierentoloursanbegeneratedbymixingtheintensities

of these basi olours. For the purpose of displaying an image to sreen, an RGB

olour model may be the optimal one to enode the data, but for other tasks an

alternativeolourspae may bebettersuited. Forexample,inRGB,lightintensity

is not independent of hue. Rather, intensity is the mean value of all three olour

dimensions. Ifwe regard say a red objet ina brightly lit room, and then dim the

lights, the objet arguably has the same olour it only reet less light. Several

olourspaes are dened sothat intensity is independent of hrominane, as inthe

HSV/HSI olour spae. In HSI I stands for intensity, S represents saturation (or

how vivid the olour is), while H hue enodes the dominant wavelength. Colour

is important in fae loalization software, beause if it is available it an help us

onentrateour searhonskinoloured areas. Furthermore,HSI istheolourspae

most often used in fae reognition systems, as studies have shown that dierene

inthe pereived skin olour between individualsof dierentethniitylies mostly in

dierenein intensity rather than hrominane [19℄. It would of ourse be possible

tothreshold the skin olourregions inRBG spae, but the region onstituting skin

olourmayrepresentaomplexvolumeorunionofsuhin3Dspae. Ifwetransform

the olourspae in suh a way that intensity represents anindependentdimension,

wean threshold the 2D hue/saturation planeinstead. Also,even thoughthere are

eient algorithms for thresholding skin olour regions diretly from RBG olour

spae,empirialevaluationshave found alternativeones (HSIand YCbCr)superior

[28℄.

6

systemsthatusemorethanonebiometrifeature

(15)

to transform them to HSI olour spae. The following simple formulas transforms

RGBintotoHSIspae. Here{R,G,B}representthethreebandsintheinputimage,

and {I,S,V}in bandsin the transformed image.

I = R + B + G 3

S = max(R, G, B) − min(R, G, B) max(R, G, B)

H =

"

acos .5 × ((R − G) + (R − B))

((R − G) ∗ (R − G)) + ((R − B ) ∗ (G − B))

!#

× 360 2π

Many faeloalizationapproahes has skinolourthresholdingasaninitialstep

intheproessing,eitherdiretlytondfaes,ormoreindiretlytoreduethesearh

spae. This is useful, as the property is fast to evaluate, invariant of rotation and

saling,and is fairly insensitive tohanges in lightintensity.

2.3 Neural networks

Abiologialneural networkis aninteronneted group of neurons. The term arti-

ialneuralnetwork(ANN) isusedformathematialmodels thatmimisome ofthe

fundamentalharateristis of the way informationis represented and proessed in

biologialneuralnetworks.

Before going on to desribe in some detail the elements and harateristis of

ANNs, I'll briey review the history of their development. Although ognitive neu-

rosiene 7

isaveryativeeldofresearh,mostaspetsoftheenodingandproess-

ingof informationinbiologialneural systems isyetto be fullyunderstood. In the

1940's,the synapse, and the hemial transmissionin the synapse wasa reent dis-

overy,andgaverisetotherst modelsofhowthe physiologialpropertiesofneural

ells allowed them to proess information. Donald Hebb wasamong the pioneering

researhers in this eld, suggesting that learning ould result from strengthening

the synapses between synhronously ring neurons. Coneptually important was

alsothe work of Friedrih Hayek, and his hypothesis that order in the brain ould

arise out of deentralized networks of simple units. These ideas in turn gave rise

to the rst working models of neural proessing. In 1943, MCulloh and Pitts

published theirartiial neuron,illustrated in gure1. Fourteen years later, Frank

7

An interdisiplinarybranhofneurosieneinvolvedintheneuralmehanismsof ognition

(16)

x 1

x 2

x n

y = f(a) w

w w

1

2 n

Σ

a = x w i i

Figure 1: MCulloh-Pitts neurons

Input Hidden Output

Figure 2: Simple multilayer perep-

tron

RosenblattdevelopedaworkingneuralnetworkmodelonsistingofMCulloh-Pitts

neurons,whihisknownasapereptron. However, afterinitialoptimism,ANN'sfell

intosomethingofanaademi disreputewhenMarvinMinsky andSeymourPapert

proved thatpereptrons are linearlassiers unableto solvelassiation problems

that are not linearly separable, suh as the simple XOR problem. Furthermore,

Minsky and Papert argued that even networks with more layers would not be able

tosolvethe XORproblem[15℄. Notuntilthe 1980's did ANN omebak intofavor

inaademia, when their onjeture was proven false.

2.3.1 Neurons - The basi building bloks

Theneurons ornodes are the basielementsof neuralnetworks. The meannumber

of onnetions feeding into a genuine biologial neuron is around 10000, although

thenumberishighlydependentonthetypeofneuronand whereitisloatedinthe

nervoussystem[5℄. Liketheirbiologialounterparts, artiialneuronsgatherinput

from neighboring nodes and pass it on, and an as suh be regarded as funtions

frommany dimensions toone.

In the nervous system, the eet one ring neuron has on another depends on

several fators. Among themare how farfrom the ellbody the synapse isloated,

the amount of neurotransmitters 8

released when the neuron res, and the number

of reeptor moleules in the reeiving ell. The two lastfators an be modulated

through learning. In the same way, the eet of one artiial neuron on another

depends on the weight of the onnetion between them, as in biologial networks.

The ativation passed on from the rst neuron is simply its ativity multiplied by

8

Chemialtransmittinginformationarossthesynaptigapseparatingtwoneurons.

(17)

sumof allinomingativation. In pereptrons,the ativationof aneuronyisgiven

by

a ( y ) = K h w, x i + b

(1)

wherex isa vetor of ativationof the neurons feeding intoy, wis avetor oftheir

assoiated weights, b is a onstant (bias), and K is a (usually nonlinear) funtion

suh as the logisti or tanh. The neurons are individually simple, and basially

onlyrelayinformationomingfromonnetedneurons. Itisthenonlinearresponse-

funtion of the individual neurons that makes the behavior of the network as a

whole omplex and diultto analyze. The non-linearity of the response funtion

ensures that the ativation of the neuron will never exeed a maximum value, in

turn ensuring the stability of the system as a whole. If the information that is fed

intothe system isorrupt insome way, or if anindividual neurons stop working as

normal,the impat is minimized. This fault tolerane is one of the most attrative

aspets of both biologial and artiial neural systems. The response funtion we

seletedforourimplementationwasperhapsthemostlassi,thelogistiorsigmoid

funtion:

f(x) = 1

1 + e ⁻ ^x

⁽²⁾

2.3.2 Network topology

Thenetworktopologyrefers tothelayoutofthenetwork,essentiallywhihnodesare

onneted, and in whih diretion information ows along these onnetions. The

topology of the network is one of the most important parameters determining to

whatdegreeit isable tosolve agiven problem. Itis ommonto dividethe network

intolevelsor layers,where nodesinone levelare mostlyorexlusively onnetedto

nodesinthe neighboring levels. Thelevelthatreeivesthevetortobeproessedis

referredto as the input layer, and informationpropagates the network terminating

atthe output layer. Between thesetwo,thereareusually oneormorehidden layers,

referredtoas suh beausetheir ativity isnot generallyvisible. It isommonthat

informationows onlyinone diretionthrough the network, whihis referredto as

the network being feed forward.

Itislearthatanalmostinnitenumberoftopologiesisavailabletoaresearher

forany given problem. How many layers should be used? What exat interonne-

(18)

Shouldwe use reurrent onnetions, onnetionswhere informationan owaway

fromthe output layer? Fortunately, neural networksare fairly robust inthat many

topologies an learn the required mappings in a given task. However, the dierent

topologiesmay varyin terms of speed oftraining, and generalizationfrom training

totest data. Though the possibilities are endless, very few guidelines exist to help

selet topologies.

So let us look briey on whih topologial parameters an be important to the

suess of ANN. General number of onnetions; If topology is too small (in

terms of units and onnetions), the network may not be able to learn the desired

input/outputmapping. On the otherhand,if itistoolarge, thenetwork very often

generalizes poorly to new input [4℄. All the weights need to be set by the training

data,and moreweightstranslates intoalargerrequired trainingset. Eet of the

number of hidden neurons; Usually the network will have fewer nodes in the

hidden layer than in the input layer. This fores the network to extrat features

from the input vetor, and the nal deision (output layer) is based on a small

numberof variables[40℄. As with the general level of onnetivity, a largernumber

ofhiddennodeswillallowthenetworktoadaptbettertothetrainingdata. However,

aswith ageneral inrease inonnetions, this adaptation may be over spei to

the trainingdata, again resultingin apoorer generalization.

Number of layers; As desribed above, a (two layer) pereptron is simply a

linear lassier, meaning it an only solve a lassiation problem when a linear

funtion an separate the lasses in feature spae. Formany problems of any om-

plexity,thisisnotenoughandwemustresorttomorelayers. Alsointermsofhidden

layers oneshould limitoneself to asfewas possible,asgeneralizabilityand training

speed willbeaeted. However, oasionallymore than twolayers may be thebest

solution. This is typially the ase when very similar input vetors must result in

very dierent outputresponses. Withfourlayers, the rst hiddenlayeran beused

toreode the data beforepassing it on tothe seond hidden layer, whih performs

the required mapping onto the output nodes. Full vs loal onnetivity; In a

fully onneted network,eah node in alayer isonneted to every one in the layer

aboveit, and this is oftenthe default hoie. One popular optionis tostart with a

fully onneted network, and prune away onnetions by testing to see if network

performane deteriorates if the weights are removed. Empirial evaluations have

determinedthat loalonnetivity performs better at fae detetion tasks [9℄. Re-

urrent onnetions; Reurrent onnetions are of ourse the norm in biologial

(19)

lassesoftaskstrulyrequiretheseonnetions,andalsobeausealgorithmstraining

these networks are far less established. The weights of the onnetions distributed

through the network an be viewed as its long term memory, adjusted to learn a

given task from a training set. However, typially the network has no short term

memory,one the informationhas propagatedthe network it annotrememberthe

patternitwaslastexposed toorthe orderin whihtwopatterns were presented. If

thenetworkistohavesuhashorttermmemory,itmusthavereurrentonnetions.

2.3.3 Network training

Bynetwork trainingwe usually mean analgorithmfor adjustingthe weights of the

onnetions in order to minimize the output error relative to some riterion. In

supervised training, the network is presented with a training vetor for the input

layer, as well as a vetor ontaining the orret responses for eah training vetor.

The total error given a set of weights is usually formalized as the sum of squared

deviationsof the output nodes from the orretresponse.

E = 1 2

X

o

(t o − y _o ) ²

⁽³⁾

The development of the bak propagation algorithmfor trainingmultilayer per-

eptrons was a major reason for the revival of interest in ANNs. Although rst

proposed by Paul Werbos in the 1970', it was after the lassial artile in 1986,

Rummelhartet. al. in Nature, thatthe bak propagation algorithmbeamewidely

popular [24℄. Eetively, bakpropagation is a way to train multilayer pereptrons

usinggradient desent. Thealgorithmsderivesitsnamefromthefatthat theerror

propagatesthe networkfromthe outputlayertothe input layer, beforethe weights

are hanged. Gradient desent is an optimization algorithm that approahes a lo-

al minimum by adjusting the weights so that we move in the steepest downward

diretionalongthe error insmallsteps.

∆w ij = −ǫ[ δE

δw _ij ]

⁽⁴⁾

Thesizeof thesteps totakeisreferredtoasthe `learningrate

δ

^. ^Ideally^we ^want^to

endup with network weightsthat leavesus atthelowest pointonthe errorsurfae.

Ahighlearningrate willusually meanthe trainingonverges faster,but if theerror

surfae is very eentri, taking large steps ould ignore important loal gradient

(20)

follows the gradient,itis very likelytoget stuk inapoorminimum, andto redue

the risk of this, a momentum is usually added to the movement along the error

surfae. As the name implies, this parameter is intended to give the algorithm

enoughmomentum to get out of smalldips in the error surfae.

In order to have a starting position on the error surfae, the network weights

are initially seeded with small random values. From there, training proeeds iter-

atively, either until the algorithmonverges{we are at a minimum, and there is no

further improvement, the error drops below a required value, or a maximum num-

ber of training iterations have been run. A host of other trainig algorithms have

been developed, suh as the rprop andquikprop, but bakpropagation remains the

workhorse of neuralnetwork training.

The error surfae is usually very eentri, and in genuine networks of any size,

we never end up at the global minimum. However, this is not neessarily suh a

bad thing. The network may still solve the problem admiraly and generalize well.

Furthermore, beause they usually never end up with the same nal weights even

whentrainedwiththesamedata,dierentnetworkswiththesametopologyrespond

dierentlytonewinput. This anbeexploitedbyombiningtheresponseof several

networks, usually referredtoas networkvoting. In otherwords, even if one ofthem

makesan error, the others an ompensate.

2.4 Geneti algorithms

A algorithm an be dened as expliit step-by-step proedure forproduing asolu-

tion to a given problem. Aording to this denition, and the way it is ommonly

usedinomputersiene, algorithmsshouldguarantee thatthe desiredendstatebe

ahievediftheproedureisfollowed. Dierentalgorithmsanvaryintheireieny

insolvingalass of problems,and an beompared byhow theomputer resoures

requiredby the algorithm,suh asthe numberof instrutionsortemporarystorage

spae,sales as the size of the inputinreases. This isformalized inO-notation

f(x) =

O u(x)

as x

→

^a,^provided ^that

|f (x)| ≤ K |u(x)|

However, ertain lassesof algorithmssuh asprobabilisti algorithms,may not

stritly fall under the denition of algorithmsgiven above. In a probabilisti algo-

rithm, the probability that the result willdeviate from the desired end state when

thenumberofiterationsinreases an bequantied, andwilloftenonvergeto0 as

the number of iterationsinreases to innity.

(21)

ahieved,nor thatthe distanefromsuh astate anbequantied. Instead, geneti

algorithms represent a general way of moving towards better solutions in ertain

omplex optimization senarios. As the name implies, it borrows stylized elements

andoperatorsfromevolutionarybiology,andinessene,asimulatedmiroevolution

isrun on the problemdomain.

For most real world problems of any size, a omplete evaluation of the searh

spaewithanysigniantdegreeofresolution isprohibitive,and oftentherearefew

reasoned hoies as to how to move towards an optimal solution. Stritly optimal

solutions,astheglobalminimumonsomeerrorfuntion,maynotevenbeneessary.

What may be desirableis anon-arbitrarily improved solution.

Genetialgorithmsmay bepartiularlywell suited for omplexproblems where

heuristiormoreformallydened methodsof optimizationare not available, where

the tness funtion is so omplex that traditional methods frequently end up at

loal minima, the searh spae is large, omplex or poorly understood, or domain

knowledgeis sare orexpert knowledge isdiult toenode [4℄.

Beforeagenetialgorithmanbeimplementedforagivenproblem,twoelements

must be in plae. These are i) a way of representing a solution in suh a way that

itan bemanipulated by the algorithm,and ii)a way of determiningthe tness of

agiven solution.

2.4.1 Elements and operators in geneti algorithms

Genes; representing solutions. In GA a solution to a problem an be thought

ofas anindividual ina populationof solutions. The solutionstothe problemmust

beoded ina way that an beated uponby the evolutionary operators. As thisis

roughly analogous to the way DNA ontains the biologialblueprint of anindivid-

ual organism, the enoding of a solution is often referred to as a hromosome. In

DNA,tripletsof 4dierentkinds ofbases odefor 21distintamino-aids. Though

omputationalanalogues of suh omplex enoding shemes are possible and have

been implemented, usually asimpler enoding sheme is used, suh as a bit string.

The important pointis that this hromosomeodes for one unique solution.

Fitness funtion; evaluating solutions. As evolution is based on the sur-

vival of the ttest, we need some way of determining how t our solutions are,

andhopefully movetoward betterones during evolution. More speially weneed

a tness funtion. This funtion takes a hromosome, deodes it into a working

(22)

M M

Figure 3: Bit mutation

c1 c2

Figure 4: Chromosomal rossover

individuals, and evaluates it's tness. What onstitutes the tness of a solution

must be dened for eah problem or lass of problems, and how good the tness

funtion is will largely determine the suess of an evolutionary approah. Unless

the tness funtion is handled properly, GAs have a tendeny to onverge towards

aloalratherthan the global optimum of the problem.

Geneti operators; modifying solutions. Given than we now have ways of

enoding individual solutions, and an determine their tness, what we need are

evolutionary operators ating upon the individuals. Two main operators are used,

mutation and rossover,and their relative importaneis widely debated. Both are

illustratedin gure 2.4.1.

Mutation is perhaps the most widely known evolutionary operator. It basially

refers to small random alterations to the hromosomes. An important parameter

ofevolutionaryalgorithmsisthe rate of mutation,the balane between hange and

stability. If the rate is too low this may ause geneti drift, or onvergene to

loaloptima. Alternatively,ifthe mutationrate istoohigh,itmay leadtounstable

hromosomal behavior and jumping around in searh spae. However, exatly

whatonstitutesagoodmutationrateis hardlyever disussed,and forallpratial

purposes must be hosen based ontrialruns.

The seond evolutionary operator is that of rossover. The ells that fuse in

oneption eah have a single version of the 23 unique hromosomes in the human

genome. Thisfusionresultsinadiploidorganism,onewithtwosetsofeahhromo-

some. During meiosis 9

, when the sex-ells are generated, the pairs of hromosomes

separate, one migrating to eah of the two new ells. However, in the proess, the

9

A elldivision thatgenerateshaploid sex-ells,i.e. ellswithonlyonesetofhromosomes.

(23)

realignment of parts of the DNA threads. A similar proess is often implemented

ingenetialgorithms. When twoorganismsare seleted for breeding, theremay be

one or more points of rossover, where large segments of their DNA is exhanged.

Crossoverthusworksonamuhhigherlevelthanmutation,exhanginglargehunks

of DNA, and therefore alsolarger number of phenotypitraits.

Up to now, a fairly stylized but biologially reasonable set of operators have

been implemented. However, as the rates of mutation and rossover are hard to

alibrate, and the gene pool tiny ompared to that in say, a small drop of water,

youriskloosinggoodsolutionsfromthepool. Toensurethatgoodsolutionsarenot

lost,aproess referredtoaselitistseletionorelitism isused. Thismerely refers to

bypassing the regular evolutionary proesses, and simply opying a handful of the

best solutionsinonegenerationintothe next. Thiswayelitismanrapidly inrease

the performane of geneti algorithms,as it prevents the lossof goodsolutions.

2.4.2 Running a geneti algorithm

We are now in a position to put all the elements together into a working geneti

algorithm. First, a pool of (often random) hromosomes is initialized. A random

poolensures thatinitialsolutionsaredistributed overmuhof theparameterspae.

To ensure that large parts of the parameter spae are searhed, the pool should

ideally onsist of a large number of hromosomes, from hundreds to thousands.

The exat number is of ourse up to the individual researher to determine, and

will depend heavily on the time taken to evaluate tness. For eah generation,

the hromosomesundergo some degreeof mutation. Pairsofhromosomesare then

seletedforbreeding,theprobabilityofseletionbeingsomefuntionoftheirtness.

Mating results in a new pool of hromosomes, with hopefully better mean tness,

andthe proessan repeat itself. The algorithman terminateafteraxednumber

of generations or when a distane from a riteria is reahed, while no onvergene

tosuh aminimum distane is guaranteed.

2.4.3 Geneti algorithms for neural networks

With the number of synapses in the human brain estimated at 10E14 [37℄, and

the number of genes in our genome urrently thought to lie between 20-30 000, it

is unlikely that evolultion has shaped the topology of this network at the level of

individual synapses. However, on a slightly higher sale, the brain is the result of

(24)

neural iruits, it is not far fethed to try to shape them with the same kinds of

evolutionary proesses. In fat, applying evollutionary algorithms to various har-

ateristis of ANNs is not new, but has emerged as an established methodof both

seletingnetworktopologyandtrainingnetworks[4℄. Untilreently,seletingnetwork

topologywas done troughhuman expert knowledge, but this may hange. Shaer

et.al. have presented results showing that the ANN designed by evolutionary ap-

proah had better generalizationability than one trained by bakpropagation on a

humandesignedarhiteture[4℄. Furthermore,whilethebakpropagationalgorithm

isby farthe mostwidely used methodof trainingnetwork,itmay not alwaysbean

option. Trainingthroughgradientdesentapproahes requirestheerrorbedieren-

tiable,while evolutionary algorithmsan train networksregardless ofwhether they

are feedforward or reurrent, and even if error is disontinuous [4℄. Evolutioinary

trainingis also less likely than graidient desent based methodsof getting stuk in

aloalminimum,beauseitsearhes overseveral regionsinparallel. Unfortunately,

forthe same reasonit has diulties in ne tuning the parameters.

(25)

3.1 An overview of the algorithms

Though faedetetion has reeived a lotof attention the past 10years, ithas been

a eld of researh for many more. With the development of better algorithmsand

more powerful omputer hardware, the eld has moved from searhing for a single

vertial frontal fae in visually simple images with homogeneous bakground, to

multiple faes in diult lighting onditions, at various angles and distanes from

theamera,situatedinomplex visualsenes[19℄. Still,even today the majorityof

algorithmspublished are onlyable to detetvertial ornearly vertial faes.

Asoutlinedinthe introdutionof thisthesis, the ability toquikly androbustly

loalize faesin images is valuable innumerous areas, ranging tosurveillane, on-

tent based indexing and human mahine interation. For this reason there are a

multitudeof published algorithmsinthis area, eahone trying tolaim superiority

onrobustness or speed.

In the myriad of approahes, it is useful to distinguish two major ategories

[39℄[19℄. Oneassumesanapriorifaemodel(model-based/topdown),andattempts

to loalize/validate by means of an expliit interrelation of faial features (suh as

eyes, mouthand nose), while the other approah assumesno priormodel.

3.1.1 Feature-based/knowledge-based algorithms

Basially,infeature-basedmethodsforfaeloalization,weattempttousewhatwe

alreadyknow aboutfaes toonstruta modelwhihan be omparedtothe data.

When presented with a new piture,a feature based algorithmusually begins with

alow level analysis,where propertiessuh asedges, olour, motionand texture are

extrated [19℄. The onstellationof these features are then ompared tothe model.

Deformable templates,templates that tan apriori elasti modelto faialfeatures

are an example of a more sophistiated feature based approah. A problem with

several of these methods, is that the deformable templates must be initialized in

the viinity of the fae. Good fae andidates an be found by looking at ellipti

skin-olourregions.

3.1.2 Image based algorithms

The dening haraterizing of image based approahes is that we do not diretly

enode our preexisting knowledge about faialfeatures intothe algorithm. Rather,

(26)

Figure 6: Thresh-

olded

Figure7: Clustured Figure8: Cleaned

Figure 9: Highpass-

ltered

thesystem learns todisriminatefaes fromnon-faesthrough analyzinga training

set. The images are is mathematially ompared with the stored representation

withoutexpliitlyextratingfaialfeatures. Popularmethodshereinludeprinipal

omponentbasedalgorithms, neuralnetworks,supportvetor mahinesand hidden

markov models [19℄. Prinipalomponent basedmethodsare presented indepth in

setion4.1.2, while aneural network based approahis desibed in3.4.

3.2 Illustrating hallenges with a model-based algorithm

Implementing theneural networkbased algorithmpresented inthis thesis was ala-

borious proess, undertaken only afteran initialmodel-based approah proved not

to be suiently robust. Model-based approahes are often fast to implement and

exeute,are easytounderstand,andhaveanintuitiveappeal. Theydohoweverrely

onthe assumed faes onforming fairly stritly to the model, and small deviations

from ideal onditions will often result in failure. In this setion I will sketh the

model-based fae loalizationalgorithmwe abandoned, and give examples of some

of the many ways it an fail. We hope this will illustrate some of the diulties

involved in automati fae loalization. The model-based attempt is loosely based

onthe algorithm of Hesieh et. al.[12℄. The steps of this algorithm are, skin olour

thresholding,skinolourlustering,featureextration,templatemathing,andthese

(27)

1. Skin-olour thresholding.

Thesimplestonditionforfaeloalizationalgorithmsisasinglefrontalfaeagainst

adistintlynon skin-oloured bakground, suhasa typialpassport photo. Under

theseirumstanes,skinolourthresholdinganbealmostsuienttoloalizethe

fae. If an area is skin oloured, has roughly oval shape and ontains some darker

internal regions, we may lassify this as a fae. Figure 6 shows the result of skin

olourthresholding of gure 5.

Figure10: Clustered 2d histogram 2. Skin olour lustering.

In natural settings, skin olour thresh-

olding willhardly ever be suient for

loalizationbeausevariationsinbak-

groundhues willmake faesdiultto

segment based on olour alone. Often

two or more faes will be merged in a

single skin oloured region, as an be

seen ingure 6. To determine whether

an image area ontains several faes,

some sort of lustering is typially performed. In our initial approah, we devel-

oped a lustering algorithm for skin olour regions, the details of whih are given

in appendix A. Essentially the lustering approah involved identifying the peaks

of the 2D hue/saturation histogram 10

. After merging lose peaks, the histogramis

divided into regions aording to the eulidean distane to the enter of mass the

losesthistogrampeak. Clustering of the thresholded2D hue/saturationhistogram

an be seen in gure 10. Figure 7 illustratesthe eet of the lustering algorithm,

and gure 8 after some leaning. We an see that in this ase the lustering algo-

rithmhas suessfully separated the dierent faes, and atmost one fae ispresent

ineahskinolourregion. Formoreexamplesofskinolourlustering,seeappendix

A.Whilewe were pleasedwiththe performane ofthe lustering algorithm 11

, oa-

sionallyitwasunabletoseparate afaefromthe bakgroundorfromanother. This

wastypiallythease ifthe bakgroundolourwassimilartothatofthe fae. As it

onlysearhed for a single faein eah skin oloured region, one oroften both were

missedif the regionontained more than one fae.

10

peakswere denedas>0.9standarddeviationsabovethemean.

11

Seewebsiteformoreexamples

(28)

3. Feature extration.

Hopefully atthis stage, the image isbroken down into aset of regionsrepresenting

dierent objets or people, and the next stage is to determine whether any one of

theserepresentsorontainsafae. This isdoneby rstextratinglowlevelfeatures

fromthe regions. Eyes, nose andmouthareoftendarkerthanthesurroundingarea,

soaommonstep is tosearh for pathes ofpixels withlowerintensitiesthan their

surroundings[12℄. We found this approah tobe too unreliable, and easilyaeted

by shadows and normal variation in light intensity. Rather than look for darker

areas, we found high-pass ltering to be more robust. However, it beomes a hal-

lengeto detetfaes atvarioussales, beausehighpass lteringof larger faeswill

leave very dierent signatures than small ones. Dierent high-pass lters may be

best suitedto enhane thefaial featuresinlarge vs. smallfaes. Dierentfeatures

ould be used for skin-olourregions of dierent sizes, but the size of this regionis

not a perfet indiator of the size of the fae within. Although better than mere

intensity,thehighpasslteringwasalsosensitivetoedgesausedbyglasses,beards,

sharp shadows et.

4. Template mathing. One the features are extrated, one must determine

whether the onstellation of features found represented afae. Hsieh et. al. used a

templateoflightintensity roughlyrepresenting what one would expet given aver-

tialfae. i.e., two darkerobjetsinthe leftand rightupperquadrants,along dark

path in the lower middle et. Our approah was to t a triangle to the enter of

massof themost prominenthighpass-lteredobjetsintheinterior 12

of the region,

12

Searhrestritedtotheinteriorregiontoavoidhighpasslteringthebordertothebakground.

(29)

positive hits from our implemented system.

as it was assumed that the most prominent objets would represent the two eyes

and the mouth. If suh a triangle t well 13

, then it was delared a hit. Searhing

fortriangles representing faialfeatures is awidely used strategy [17℄. Usually itis

assumedthat thefaesarevertialornearlyvertial,andifthis isnotthe asethen

they will be missed. Approahes that require allfeatures to be found, will alsonot

detetfaes whereone of these are oluded.

Figure11illustratesthe suess of themodel-based loalizationalgorithmonan

easy image. We an see how the eyesand mouth are the most prominent objets

after highpass ltering, and the triangle ts perfetly. On the other hand, gure

12 shows examples of images in whih the model-based approah would result in

detetion failure. Fators that make these images diult is the lak of olour,

diultlighting onditions, rotation and olusion,and the vast dierenein sale

among the faes. One might argue that a fae reognition system does not need

to be robust under diult onditions. This is beause the urrent reognition

algorithmsare all so sensitive to the the same variations in lighting and olusion

thatarelikelytothrowoamodel-basedloalizationalgorithm. However,itwasour

experienethatoftenthealgorithmwouldfailevenunderseeminglygoodonditions.

Furthermore,even whenafaewasorretlydeteted,thisdidnot neessarilymean

thatthealgorithmwasabletoboundthefaesuientlywellforreognition. Often

itonfusedamouthwith the shadow underthe nose,or theeyeswith theeyebrows

or rims of the glasses, resulting in a bounding box overing either too littleof the

fae, or too muh of the bakground. For an example of loalization failure, see

appendix A.

13

thelengthofallsideswereroughlyequal

(30)

100 200 300 400 500 600 700 10

20

30

40

50

60

−30 0 −20 −10 0 10 20 30

500 1000 1500 2000 2500

100 200 300 400 500 600 700

10

20

30

40

50

60

−30 0 −20 −10 0 10 20 30

500 1000 1500 2000 2500 3000

Figure13: Horizontallineonguretotherightindiatereognitionthreshold. Trans-

lationand saling of even smallamounts leads to reognition failure.

3.3 The importane of aurate loalization for reognition

Fae reognitionalgorithmsare oftentested onvariationsinlighting, pose or faial

expression relativeto the storedtemplate, inan eorttodetermine howrobust the

algorithmis. However, rarely doauthorsexplore howsensitive thesealgorithmsare

tosub-optimal segmentation of the fae area. In other words, howaurately must

the oordinatesof aboundingbox aroundafaebeinorder tobeable toreognize

the individual, and how does performane degrade as a funtion of dierent kinds

of sub-optimal loalization. As we had deided to use the eigenfae approah 14

for

the reognition we rst wanted to get animpression of how sensitive it was to the

aurayof the loalization.

A prototypeof aeigenfae based reognition system was implemented inMAT-

LAB.Wethenseleted 31imagesfromtheStirlingfaedatabase [29℄,andmanually

found the oordinates of the faes. Reognition rates were evaluated as a funtion

of systemati variations in these oordinates 15

. The kinds of variations performed

inluded horizontal, vertial and diagonal translation, rotation and saling. Figure

13illustratestheapproah,andthegraphtotheleftshowshowhangeinpositionof

theboundingboxinreased thedistane fromthe storedtemplate 16

. The prototype

indiated that in the images we used, and at asale of

64 × 64

^pixels, ^a ^vertial ^or

horizontal translation of more than 4 pixels usually resulted in a mislassiation,

as did rotation > 6 degrees. The exat numbers are not important, but our trials

14

Thedetails oftheeigenfaebasedapproaharegiveninsetion4.2

15

Faesweredownsaledto64

×

⁶⁴^pixels^before^reognition

16

Themaximumdistaneforreognitionwassetto1.4E3,basedoninitialtrials

(31)

and the boundingbox does not have to be muh ofor the algorithm tofail.

A neural network based approah willusually indiatehits inmultiple adjaent

loations. Given the results above we therefore propose not to searh overlapping

fae hits for the single best math, but rather attempt to reognize eah box indi-

vidually. Thismay inreasereognitionrates, ashanes arethatatleastone ofthe

boundingboxes suientlywellsegmentsthe faeforthe individualtobelassied

orretly. On the other hand, false positives or poorly segmented faes will not be

reognized,and willbedisarded duringreognition.

3.4 Rowley's neural network-based approah

Figure 14: Topology of Rowley's fae de-

tetion network Although a multitude of neural net-

work based approahes to fae loal-

ization have been published, Rowley,

Baluja and Kanade are usually red-

ited with the pioneering algorithm. A

neural network-based upright frontal

fae detetion system was published

in 1998 [22℄, and later the same year

it was extended to handle faes ro-

tated in the image plane [23℄. Subse-

quentapproahesusuallyborrowoneor

more fundamentalelements of their al-

gorithm,themajorones being animage pyramid, arouternetwork and adetetion

network.

The neural networks reeive their input from a 20

×

²⁰^sliding ^window ^entered

around every pixel in the image. A 20

×

²⁰ ^sliding ^window ^has ^been ^popular ^also

invetor mahine and likelihood-basedapproahes tofaeloalization[25℄. As this

window is of xed size, the only way to detet faes at dierent sales is to san

the image at inreasingly lower resolutions. Therefore, given an input image, an

image pyramid is rst onstruted. In Rowley's implementation, this pyramid is

onstrutbysub-samplingtheoriginalimage. Ateveryloation,theslidingwindow

is histogram equalized before it feeds into the router network, whih is trained to

estimate the amount of lokwise rotation of an area. The histogram equalization

is performed to ompensates for dierenes in amera input gains, and improves

(32)

Theoretially, rather than using a router network, eah loation ould be sub-

jetedto35detetionnetworks,eahtrainedonaspeifaeorientation. However,

theneuralnetworksare fairlyomputationallydemanding,andthisapproahwould

requireamultipleof35sans omparedto therouter approah. Forthe routernet-

work, Rowley uses a fully onneted three-layer pereptron, with 400 input nodes,

15hidden nodes, and 35 output nodes. The network is trained on images of n

×

¹⁰

degrees rotation. The most ative output node represents the estimated amount

oflokwise rotation. The window an then berotated antilokwise the estimated

numberofdegrees. Abivariatelinearregressionisusedtoreduetheimpatofdire-

tionallighting. Followingthelighting orretion,thewindowishistogramequalized

one again, and presented to the detetion network. As the name implies the task

of the detetor network is to lassify a given window as a fae or non-fae. The

topologyof the detetor network is skethed in gure 14.

Itis learthat this approahisveryomputationallyintensive. A800x600 pixel

image will give rise to more than 750 000

20 × 20

^pixel ^windows ^to ^san. ^Eah

window must be equalized, rotation estimated, rotated bak, tted to a bivariate

linearregression, equalized again and run through several detetor networks. Row-

ley'srouter network onsists of 6530 network edges, so a single rotationestimation

will quikly involve more than 10000 oating point operations. As well as being

omputationallyintensive,thevastamountsofwindowssanned, usuallyonlyafew

ontainafae. Thelargenumberoftrialsandthelowbaserateanquiklytranslate

intoalotoffalsepositives. Toounteratthis, Rowleyusesthe followingstrategies;

anarbitration sheme, bootstrap training and network voting.

Their arbitration relies on simple heuristis, suh as disallowing overlapping

faes, and requiring multiple hits in the same image region. Bootstrapping in this

ontext refers to running the algorithm on images without faes, and adding the

falsepositivesto the training set. Another method of reduing the number of false

positivesis using several detetion networks, eah trained withdierentinitialran-

dom seeds to validate the presene of a fae. Howthe deisionsof these individual

lassiers in turn are treated has been the subjet of some investigationin the lit-

erature(see ex. Hjelmås [11℄). The network voting involves AND'ing the response

of two networks trainedondierent initialrandom values.

(33)

Our approahuses the main onstituent part of Rowley's algorithm,while we have

attemptedto reduerunning time. This isprimarily done through:

i) using a skin olour and image texture model to substantially redue the searh

spae.

ii) Usingan evolutionarystrategy to reduethe numberof edges inthe routernet-

work.

3.5.1 Preproessing: Colour transformationand onstruting the image

pyramid

Our program aepts unompressed TIFF [1℄ images as input. When the program

is run, the image is rst heked for olour information. If the input image is in

RGB format, itis transformed intoan HSI olour spae. The details of this olour

transformationsare given in setion2.2. The resultingimage has three bands. One

onsisting of hue (H), one of saturation (S) and the last of pixel intensities (I). In

ourimplementation,theolourinformationissolelyusedtodisardnon-skin-olour

regions. The detetion step itself is olour blind, and operates exlusively on the I

bandof theimage. WhereasRowleyusedasub-samplingapproah inonstruting

the image pyramid, we eventually settled on bilinear interpolation in downsaling

theimage. Asub-samplingapproah resulted inlargedeteriorationonhigherlevels

inthepyramidandpoorerdetetorperformane,whiletheextraproessingrequired

for the interpolation is minusule ompared tothe other omputationallyintensive

steps inthe algorithm.

3.5.2 Optimizing the searh by using olour and texture

The main limitation of a neural network based approah is that it is very ompu-

tationally demanding. Rowley's 1998 implementation used 383 seonds to san a

320

×

²⁴⁰ ^pixel ^image ^for ^vertial ^faes¹⁷^. ^With ^the ^more ^densely ^onneted ^router

network, and additional operations of rotation and equalization, the time for de-

teting rotated faes will easily be a multiple of that for vertial. In fat, it will

beshown insetion 6that ompensatingfor faerotationinreases proessingtime

in our implementation by up to 500%. If the algorithm is to be genuinely useful

intraversing large databases, the runningtime must be redued substantially. The

17

Benhmarksperformedona200MHzR4400SGIIndigo2system

(34)

arding areas of the image surfae where the presene of a fae is highly unlikely.

Withalmost all running time of the algorithmtied up in the neural networks, dis-

arding a perentage of the image results in an almost linear redution in running

time. As an beseen ingure 3.5.2, often substantialregions of the original image

an be ignored if the image is thresholded on texture and olour. While several

authors have suggested using skin olour to redue the searh spae, we have not

seen a similar argument presented for texture, although numerous papers present

texturebased methodsfor loalizing faes[19℄.

Empirialstudies havedemonstrated thatskin olour of people of dierent eth-

niity varymostly inlightintensity,andrelativelylittleinhromatiquality (under

normal lighting onditions). Skin olours were dened as an area of the HS plane,

withhue valuesrangingbetween <0,130>, saturationvaluesbetween<0,70>,(and

anintensity level >40).

Colour an be used in several ways, either by disarding non-skin olour areas,

or by lowering the detetion threshold in these areas. In our approah, we simply

disardthose areas of the image that fall outsidethe thresholds.

Ifanimageontainsasingle largefae,itwillstillbesanned onmany dierent

resolutions,mostof whih aremany levelsbelowthe one onwhihitisdeteted. In

fat, by far the most windows willbe sanned in the lowest levels of the pyramid.

At these bottom levels

20 × 20

^windows êntered ôver ân ârea ôf ^the ^fae ^will

ontainskin-olouredpixels, buthaveatexturethatlets usdisardit. Furthermore,

in graysale images, whih is still the norm in most available fae databases, we

obviously annot use skin olour. At the

20 × 20

^pixel ^levels ^at ^whih ^faes ^are

deteted, they have textures that set them apart from most non-faes, so prior to

detetiontheimageanbethresholdedontexture. Inseletingtexturestoalulate,

wemust of ourse hoose ones that are able to disriminatewell between faes and

non-faes. However, as up to a million windows must be evaluated for eah image,

goodtexture andidatesmust alsobefasttoalulate. Basedempirialevaluations

weseleted variane and entropy.

Varianeis here dened as:

1 n − 1

25 X

i =1 25

X

i =1

X _i − X ¯ ) ²

⁽⁵⁾

This property is used beause under normal lighting onditions faes have areas

with lower light intensities (eyes, mouth, shadow under nose et.), and better lit

(35)

Figure 16: Texture

mask

Figure17: Colour mask

Figure 18: Combined

mask

Implementing an automatic face recognition system

UNIVERSITETET I OSLO Institutt for informatikk