Design Priniples for Isolation Kernels
∗
ÅgeKvalnes
1
DagJohansen
1
Robbert vanRenesse
2
Fred B. Shneider
2
Steen Viken Valvåg
1 1
University of Tromsø
2
CornellUniversity
TehnialReport2011-70
ComputerSieneDepartment
UniversityofTromsø
Abstrat
An operating system must ensure that no hosted servie an ause
theservielevelagreementofanothertobeviolated. Ifontrolisinom-
plete,noamountofover-provisioninganompensateforitandtherewill
inevitably be ways to irumvent poliy enforement. Still, ompeting
serviesareoftenonsolidatedonthesamemahinetoredueoperational
osts. This artile presents designpriniples for onstruting operating
systemswhereall resoureonsumption isundershedulerontrol. The
viabilityofthepriniplesservingasadesign-foundationissubstantiated
throughthe implementationofa newoperatingsystem kernel thatpro-
vides ommodity operating system abstrations. Using this kernel, the
eayofthepriniplesisexperimentallyorroborated.
1 Introdution
Appliationservieprovidersandhostedserviestypiallyrunserviesonshared
mahinestoredueoperationalosts[2,3,14,15,21,56,63℄.Theperformaneof
oneservieis then vulnerable to loadsurges in other servies. So, aprovider
mightviolateservie-levelagreements(slas),leadingtolostustomersormon-
etarypenalties[43,44℄.
Theonventional approahto meetingperformaneguaranteeshasbeento
quantify softwareand hardwarerequirementsmetiulouslyandthen toimpose
admission ontrol and resoure reservation. This works well if loads an be
antiipated. But hosted servies typially are subjet to unpreditable load
surgesandtime-varyingresouredemands. Toaommodatesuhhighvariane
byusingreservationsauseshardwareutilizationtosuer.
∗
Thiswork was supported inpart bythe Researh Counil of Norway as a Center for
Researh-basedInnovation(iAD),ONRgrantsN00014-01-1-0968N00014-09-1-0652,AFOSR
grantF9550-11-1-0137, AFRL,NSFgrants0430161, 0964409, CNS-0828923, CNS-1040689
(Nebula),andCCF-0424422(TRUST),andagiftfromMirosoftCorporation.
another to be violated is alled an isolation kernel [51℄. The kernel typially
providesinstrumentationforattributingresoureusagetoindividualhostedser-
viesandemploysshedulersthatusethisusageinformationforenforingslas.
pu,memory,diskaess,andnetworkbandwidthareamongtheresouresthat
mustbesheduledtoignoreanyrisksviolatingansla. Forexample,ansla
guaranteeingsomespeiedleveloflesystemthroughputanbeviolatedwhen
there is insuient pu time to handle le i/o, insuient memoryto buer
ledata,orinsuientdisk bandwidthto readorwritelebloks.
This artile presents a new isolation kernel, Vortex. Vortex implements
ne-grainedaountingand sheduling of systemresoures. It denes an ab-
strationforenapsulatingresoures,asystemstruturethatallowsresouresto
besheduledindividually orin aoordinated fashion, andaommon interfae
toresoure-usageaountingandattribution.
Threedesignpriniplesservedasafoundationforthedesign:
(1) Measure all resoure onsumption. If hosted servies an onsume re-
soures whose usageis not measured, then resouresharingpoliies an
be irumvented. Consumption of resoures is, to the extent possible,
attributedbyVortextothehostedserviemakingthedemands 1
.
(2) Identify the unitto be sheduled with the unit of attribution. Considera
workerthread handlingasynhronousi/o requests onbehalf of multiple
hosted servies (an approah used in Windows). If this worker thread
istheunit beingsheduled,thentheshedulerhasnoontroloverwhih
i/orequestsarehandled,evenifresoureonsumptionouldberetrospe-
tivelyattributedtotheorrespondinghostedservie(s).Betterontrolan
beahieved by diretlyshedulingtheindividual i/o requestsinsteadof
the worker thread. That is, a one-to-one orrespondene is established
betweentheunit ofshedulingandtheunitofattribution.
(3) Employ ne-grainedsheduling. Thisallowslesserrorin attribution and
inreasesopportunitiesforsharing.
Therestofthisartileisorganizedasfollows.InSetion2weoutlinethekey
elementsoftheVortexarhitetureanddisussimpliationsofourthreedesign
priniples. Setion 3 givesa detailed exposition of important elementsin our
implementationofVortexonthex86platform. Setion4presentsanevaluation
of theimplementation, using dierentbenhmark appliations to determine if
our Vortex implementation instantiates our design priniples. Related work
appearsinSetion 5,andSetion 6oerssomeonlusions.
1
Someresoure onsumptionishard toattributeat the timeof onsumptionand must
beattributedaposteriori. Examplesinlude:putimedevotedtoproessinginterruptsand
demultiplexinginomingnetworkpakets.
pathrequests. sheduler.
() Resouresorganizedinagridwithshedulersontheom-
muniationpath.
Figure 1: Summaryof keyarhitetureelements.
2 Kernel arhiteture
2.1 Arhitetureoverview
Figure 1 depits the key elements of the Vortex arhiteture. Eah resoure
orresponds to a ne-grained software omponent, exporting an interfae for
aess to and use of hardware or software, suh as an i/o devie, a network
protoollayer,oralayerin alesystem.
Higher-levelkernelabstrationsand funtionalityare implemented byon-
guringresouresintoaresouregrid,whereresouresexhangeresourerequest
messages. A resourerequest message speies parametersand afuntion to
invokeattheinterfaeofthedestinationresoure. Theserviingofarequestis
asynhronoustothesendingresoure.
Shedulers may be interpositioned between resoures. Requests reeived
bya sheduler may be buered and/or dispathed to aresourein any order
onsistentwithinter-requestdependenies.
Toaountfor resoureonsumption, exeutionin response to arequestis
monitored. Themonitoring is performed external to aresoure, using instru-
mentation ode that measures pu and memory onsumption to exeute the
request,perhapsdeterminingthosevaluesretrospetively. Aftereahrequestis
exeuted,itsresoureonsumptionisreportedtothedispathingsheduler.
All resourerequests speify an ativity to whih resoure onsumption is
attributed. Ifaresouresendsrequest
r 2
aspartofhandlingrequestr 1
,thentheativity of
r 2
is inheritedfromr 1
. Computations involvingmultiple resoures anthusbeidentiedasbelongingtooneativity. Anativityanbeaproess,aolletionofproesses,orsomeproessingwithinasingleproess.
tions,where anativitytypiallyisassoiatedwithaproess.
2.2 Measure all resoure onsumption
Thepuonsumption inurredbyadisk deviedriverto handlearequestfor
reading
10
setorsonadiskistypiallythesameaswouldbeneededforarequest toread20
setors. Butmemoryusagediersforthese tworequests. Moreover, the atual elapsed time for exeuting the two requests will vary, dependingonthe ontents ofdisk ontrollerahe,the position ofdisk heads,rotational
position,et. Thus,adiskisanexampleofaresourethat,foreetiveontrol,
requires a sheduler with aess to information that is not easily aptured in
software,butouldbepreditedbysoftware. Forexample,theontentsof the
diskontrollerahemightnotbeaessiblebutanbeestimatedbyknowledge
ofitssizeandobservationsofhowlongittakestoompleterequests.
Togiveshedulers aess to hiddeninformation, Vortex usesresoure on-
sumptionreords. These areextensible data struturesdesribingtheresoure
onsumptioninurredbyexeutingaresourerequest. Fieldsonerningbasi
resoureonsumption are set by Vortex instrumentation ode, and additional
eldsareattahedbyinstrumentationodeinsidetheresoureitself. Forexam-
ple,reordsdesribingresoureonsumptionwhenexeutingadiskreadrequest
ouldinlude pu and memory usagealong withadditional information: how
longittook to omplete therequest, andthe size ofthe queueof pending re-
questsatthediskontroller. Thisadditionalinformationwouldbesuppliedby
instrumentationoderunninginthediskdriver.
Measurement and attribution of resoure onsumption are separate tasks.
Measurement is always retrospetive whereas attribution may or may not be
knownin advaneoftherequestproessing. Forexample,when areadrequest
is submitted to a disk driver, the ativity to attribute is typially known in
advane, but resoureonsumption mightnot be available until after request
exeutionompletes. Anotherexampleisinterruptproessingorearlynetwork
paket proessing, where the ativity to attribute is diult to dedue until
proessingompletes. If resoureuse must be predited,then asheduler an
useheuristisbasedonhistorytoestimateresoureonsumption.
If attribution annot be determined, for exampleif an ativity annot be
assoiated with some network paket proessing, slas might beviolated. No
amountofinstrumentation,sheduling,orover-provisioning,anensurethatan
slawillbesatisedinthefaeofunantiipatedload.Theimpliationisthatan
isolationkernelimplementationmustmakeassumptionsabouttheenvironment.
2.3 Identify the unit to be sheduled with the unit of at-
tribution
Ourarhiteturerequiresshedulerstoontrolexeutionofindividualrequests,
whereeahrequestspeiesatmostoneativityforattributionofresoureon-
sumption . Notie, however,that even ifeah request is identied with some
ativity,thenattributionambiguityremainspossible. Consideraleblokahe
thatoptimizesmemoryutilizationbysharingidential lebloksarossativi-
ties. Iftwoativitiesaessthesameleblok,thentheresoureonsumption
inurredby fething andahing theblokould oneivablybe attributed to
eitherativity. Theshedulershouldthereforebeawareofthesharing. Inpra-
tie,this isaomplishedbyreordingresoureonsumptionreordsprodued
whena le blok is fethed and ahed, andhavingthese reords available to
shedulers.
Timelyexeutionofarequestmustbeensured,andsharinganauseom-
pliationshere. Consideraleblokrequestmadewhen anidential leblok
is alreadysheduled forfeth to satisfysome other ativity. i/o utilization is
improvedbydelayingthisseondfethrequestuntilthefethfortherstom-
pletes. But,depending onthesheduler, thepending feth ouldbesheduled
soonerifperformedin ontext oftherequesting ativity. So, timelyexeution
requires knowledge of a seond request, and using priority inheritane teh-
niques [57℄. Our poliies for attribution and sheduling must aommodate
suhnuane.
2.4 Employ ne-grained sheduling
Ashedulermightnotbeabletopreditwhatresoureonsumptionwillresult
fromashedulingdeision. Forexample,aleistypiallyimplementedusinga
leblokahe,lesystemode,avolumemanager,and adevie driverlayer.
Eah employs ahing, and a le system request ould traverse all or only a
subsetof thelayers. A sheduler isunlikelyto knowin advanewhat layersa
lerequestwilltraversenorwhatisahedatthetimearequestismade. Thus,
onsideringlerequestsastheunit ofshedulingmightentangleresouresthat
asheduler would wantto ontrol separately. Forexample,a shedulermight
wanttoontrolrequeststotheleblokahebasedonmemoryonsumption,
whereastheamountofdatatransferredmightbeadesirablemetriat thedisk
driverlevel. Todisentangleresoureonsumption, theVortexkernelis divided
intomanyne-grainedresouresthatanbeontrolled separately.
An inreased number of resouresimplies a orresponding inrease in the
numberofrequeststhat haveto be sheduled. Thisinreasesshedulingover-
head. Toredueoverhead,ourarhitetureexeutesallrequeststoompletion.
Oneasheduler dispathesarequestto aresoure,theproessing ofthat re-
questisneverpreempted. Theabseneofpreemptionimpliesthatrequestsan
bedispathedwithlittleoverhead.
Our arhiteture expets resoures to handle onurrent exeution of re-
quests, as needed on a multi-ore mahine. Consequently, resouresuse syn-
hronizationmehanismsto protettheirsharedstate. Abseneofpreemption
simpliesthingsonsiderably. A systemthatdidhavesupportforpreemption
2
Hardwarerestritionsmightlimitashedulertoontrollingexeutionofanaggregateof
requests. For example,thehardware mightnotsupportidentifyingativitieswithseparate
interruptvetors.
ofrequestexeutionwouldhavetoreleaseloksbeforereturningontrolto the
shedulerorriskdeadloksduetopriorityinversion[57℄. So,ashedulerinsuh
asystemwouldhavetomakeallowanesforinreasedrequestexeutiontimein
theaseofontestedloks. Vortexshedulersneednotbeonernedwithsuh
ompliations.
3 Kernel implementation
3.1 Sheduler toolkit
Vortexemploysatoolkitthatenapsulatesandautomatestasksommonaross
shedulers. The toolkit provides implementations for aggregation of request
messages,inter-shedulerommuniation,managementofresoureonsumption
reords,resourenaming,andinter-ore/puommuniationandmanagement.
The toolkit provides request queuesas ontainers for requests that require
a spei resoure, as illustrated in Figure 2. Whenever a resoure sends a
request,the toolkitloatesanexisting request queueorreatesanewone,on
whihtherequestwillbequeued. Asheduleranread,reorder,andmodifythe
queue. A typialsenarioarises with disk requests, where the order in whih
requestsareforwardedtothediskisre-orderedtoreduediskheadmovement.
Dependenies among requests are speied by assigning dependeny labels
to requests. Shedulers ensure that requests with the same dependeny label
areexeuted in the ordermade. Requests belongingto dierent ativitiesare
always onsidered independent, as are requests sent from dierent resoures.
Assuh,aresourean generatedependenylabelsbyusingasimpleounter,
whih isonatenatedwiththesending-resoureidentierandtheidentierof
theativitytoattribute.
Eah request is represented using a data struture ontaining: the desti-
nationresoure, thesending resoure, theativity to attribute, a dependeny
label, an anity label, and a desription of whih funtion to invoke in the
destinationresoure(alongwithparametersto thatfuntion).
Figure 3illustrates thedierentstepsinvolvedfromwhen arequest issent
request. request.
Figure3: Stepswhensendingandexeutingarequest.
until it is exeuted in thereeiving resoure. Sending a request follows three
stepsinFigure3(a)where(1)theshedulerassoiatedwiththequeueisnotied,
(2) the request is queued, and (3) the sheduler is given an opportunity to
requestputimefromapumultiplexorbeforeontrolisreturnedbaktothe
sendingresoure.
Then, asdepited in Figure 3(b), exeutionof arequest follows four steps
where(1)thepumultiplexordeidestoallotputimetoapartiularresoure,
(2)thegoverningshedulerisonsultedforadeisionasto whatrequest(s)to
dispathtotheresoure,(3)theseletedrequest(s)aredispathedandexeuted
toompletion,and(4)resoureonsumptionreordsaremadeavailableto the
governingsheduleratsome,possiblylater,point.
A sheduler anbeonguredto request resouresfrom another sheduler
instead of from a pu multiplexor. This provides a means to ontrol other
shared resoures. For example, i/o devies are typially attahed to a host
omputerthroughani/obusthat anbesharedwith otheri/odevies. This
bus may, in turn, be part of a hierarhy of shared buses, terminating at an
interfaeto main memory. If theaggregateapaity of onnetedi/o devies
exeedsthe apaityof the bus hierarhy,then the apaity ofany singlei/o
deviewillvarydependingonurrentbusload. Utilizingtheabilitytoongure
shedulerstorequestresouresfromanothersheduler,ani/obussheduleran
beintroduedwithouttheneedtomanifestthei/obussesaspreedingresoures
intheresouregrid.
Moredetailsonshedulerimplementationanbefoundin theAppendix.
3.1.1 Sheduling multi-orearhitetures
In a multi-ore system, one pu multiplexor is assigned to eah ore. Eah
multiplexorontrolshowtheoreissheduled. Toeientlyexploitmulti-ore
arhitetures,ertainsets ofrequestsare best exeutedonthesameoreoron
oresthat an eiently ommuniate. Forexample, weimproveahe hits if
requeststhat resultin aessto the samedatastrutures are exeutedon the
sameore.
Toonveyinformationaboutdataloality,resouresattahanitylabelsto
requests. Anitylabelsgivehintsaboutpu multiplexorpreferenes; ifapu
multiplexorreentlyhasexeutedarequestwithapartiularanitylabel,new
requestswiththesameanitylabelshouldpreferablybeexeutedbythesame
pumultiplexor.
The toolkit onsults the sheduler preeding a resoure to obtain a pu
multiplexorbindingforananitylabel. Thereturnedbindingisahedbythe
toolkit until an expiration speied by the sheduler; until expiration, subse-
quent requests with the same anity label are exeuted by the seleted pu
multiplexor. Thetoolkitensuresthat(1)requestsareonlyexeutedbythepu
multiplexorseletedbythegoverningsheduler,(2)putimeisonlyrequested
fromseletedpumultiplexors,and(3)apumultiplexoronlydequeueseligible
requests.
Figure 4 illustrates a sheduler requesting pu time from four pu multi-
plexors. One way to instantiate this onguration is to allow sheduler and
queuestate to beaessed onurrently by allfour pu multiplexorson both
requestqueueanddequeuepaths. Thisdesignriskssynhronizationbottleneks
and exessive inter-ore exhanges of sheduler and queue state. Tomitigate
thisrisk,thetoolkitalwaysinstantiatesmulti-oreongurationswithseparate
requestqueuesperore,asillustratedinFigure5. Inaddition,thetoolkitpro-
motes asheduler struture that separatesshared andore-spei state. For
example, a round-robin sheduler would maintain per-ore state about regis-
tered lients (i.e. request queues) along with a shared ounter for reating a
pumultiplexorbinding. Similarly,aweightedfairqueueing(wfq) [18℄shed-
uler would maintain per-ore stateabout lients but rely on a more omplex
strategyfor deidinghowanity labelsare bound topu multiplexors 3
. Un-
3
Ourwfqimplementationinspetsper-orestatetodeidewhihpumultiplexorshould
handlean anitylabel; oneload sharingalgorithm thatwehave implemented assigns the
label tothe ore at whihthe orresponding ativityhas proportionally reeived the least
resoures.
der this struture, sharing typially only ours when requests are sent from
oneoreand queued forexeution onanother, and when asheduler inspets
sharedstatetoseletapu multiplexorforananitylabel.
Withseparaterequestqueuesperore,exeution-orderonstraintsimposed
bydependenylabelsaretrikytosatisfy. Ifrequestswiththesamedependeny
label are queued to dierent pu multiplexors, then load imbalane among
the pu multiplexors ould result in violating exeution order dependenies.
This is preventedin Vortexby requiring resouresto assign the same anity
labeltodependentrequests,ausingdependentrequeststo havethesamepu
multiplexorbinding,henebeplaedinthesamerequestqueue.
Anotherompliation,whihishandledbythetoolkit,isexpirationofapu
multiplexorbinding. Ifabindingexpireswhiletherearequeuedrequests,then
thetoolkitwill, inoneatomiation,obtainanewbindingfromthegoverning
sheduler, move aeted requests to a potentially new queue, and update its
pumultiplexorbindingahe.
3.1.2 Sheduler onguration
Aongurationleprovidesthetoolkitwith informationitneeds forinstanti-
atingshedulersin aresouregrid. Theongurationledesribesthetypeof
shedulertouseateahresoure,aswellasdesribingongurationparameters.
Theproessofinstantiatingthese shedulersisfullyautomated: at boottime,
thetoolkitreadstheongurationleandinstantiatesshedulers.
The toolkit maintains a repository of all available shedulers. Shedulers
in thisrepositoryare ompiledaspartofthekernel. Eah sheduler isnamed
aordingtothetypeofalgorithmitimplements. Forexample,ourwfqshed-
ulerfallsintotheategoryproportionalshareshedulersandis,assuh,named
propshare.wfq.Thenameofashedulerisusedinaongurationletospeify
<
?xmlversion="1.0"?>
<
shedulerong>
<
!−−
CPUMultiplexors−− >
<
pumultiplexortag ="pumux0">
<
ore>
0<
/ore>
<
algorithm>
propshare.wfq<
/algorithm>
<
/pumultiplexor>
<
pumultiplexortag ="pumux1">
<
ore>
1<
/ore>
<
algorithm>
propshare.wfq<
/algorithm>
10<
/pumultiplexor>
<
!−−
Resoure shedulers−− >
<
resouresheduler>
<
resoure>
resoure.tp<
/resoure>
<
algorithm>
propshare.round−
robin<
/algorithm>
<
pumultiplexor>
<
tag>
pumux0<
/tag>
<
share>
20<
/share>
<
/pumultiplexor>
20<
pumultiplexor>
<
tag>
pumux1<
/tag>
<
share>
40<
/share>
<
/pumultiplexor>
<
/resouresheduler>
<
resouresheduler>
<
resoure>
resoure.thread<
/resoure>
<
algorithm>
priority.strit<
/algorithm>
<
pumultiplexor>
<
tag>
pumux0<
/tag>
30<
share>
40<
/share>
<
/pumultiplexor>
<
/resouresheduler>
<
/shedulerong>
Figure6: Exerptfromashedulerongurationle.
thepartiularshedulertoassoiatewitharesoure.
Figure 6 ontains exerpts from a onguration le, where a round-robin
sheduler is seleted for the tp Resoure and a strit-priority sheduler is
seletedfortheThreadResoure 4
. Thetpshedulerisongured torequest
pu time from both pu multiplexor
0
and pu multiplexor1
; the Thread Resoureonly requests pu time from pu multiplexor0
. The onguration of Figure 6is an exampleof an asymmetri onguration, i.e. a ongurationwhereresouresareonguredto useonlysubsetsoftheavailable ores. Suh
ongurationsarefullysupportedbythetoolkit. Thisallowsdeploymentswith
someoresdediatedtoresoures,wheresalingthroughne-grainedlokingor
avoidaneofshareddatastruturesisdiult. Typialexamplesareresoures
that govern i/o devies using memory-based data strutures to speify dma
operations.
4
TheThreadResoureprovidesathreadabstrationforproesses.
Thetoolkitdoesnotanalyzesheduleromposition,soaongurationmay
ontainaws. For example,ifaresoureissheduledusinganearliestdeadline
rst[41℄ algorithmand pu timeis requestedfrom apu multiplexorusing a
wfqalgorithm,thentheresouresheduleranmakenoreal-timeassumptions
about deadlines. Reasoning about orretness requires a formalization of the
behavior of eah sheduler, and then an analysis of the interation between
behaviors. See[22,25,37,40,54,55℄forworkinthisdiretion.
3.2 Virtual memory management
TheVortexvirtualmemorymanagement(vmm)arhitetureisdepitedinFig-
ure 7. The Address Spae Resoure (asr) implements logi for onstruting
and maintainingpage tables and also provides aninterfae for alloating and
ontrollingtranslations for regions of anaddress spae. asr is used by other
resourestoexportandmakedataobjetsaessibleinaproessaddressspae.
Forexample,theExeutableResoure(er)usestheasrinterfaetoexportthe
segmentsofanexeutablele(text,data, bss,et.) intothepertinent regions
oftheaddressspae.
Pagefaultsarediretedtotheasr. Tohandleoneofthese,asrdetermines
whether thefaulting addressis in a regionalloated by someresoureand, if
so,sendsarequestfordatatotheresoureresponsibleforthataddress. When
reeivingsuh a request, resoures are requiredto respond with data already
ahedin theresoure,byalloatingmemoryfrom thememorymultiplexoror
by retrieving the data from other resoures. For er, further ommuniation
withtheFileCaheResoure(fr)istypiallyperformedtoretrievedatafrom
theexeutablele.
TheSwapResoure(sr)providesaninterfaeforpreservingobjetsonse-
ondarystorage. Resouresusesrwheneverrelaimedmemoryontainsobjets
noteasily reonstrutedfrom other soures. Forexample, textanbe re-read
fromanexeutable,butmodiedheapand bssmemorymustbepreservedfor
Relaimingmemory
Whetheradditionalmemoryisneededwhenexeutingarequestisdiultfor
thesendingresouretodeterminewithoutaesstostatethatisinternaltothe
reeiving resoure. For example, the reeiving resoure might use ahing to
speedup request proessing. Therefore, resoures alloate memory from the
memory multiplexor when needed, typially as part of exeuting a request.
Availablememorybeinglowortheorrespondingativityexeedingitsmemory
budget,auses thememory multiplexorto rejet analloation. Insuhases,
memoryrelamationationsmust be initiatedto ensureeventualexeutionof
theoriginalrequest.
The memorymultiplexor deideswhat physial memory to relaim. A re-
souremustbepreparedtorelinquishreferenesto alloatedmemoryuponre-
eivingmemoryrelamationrequestsfromthememorymultiplexor. Forvoiding
referenesto thephysialmemoryspeiedin arelamationrequest,resoures
are required to determine what that memory is used for. To maintain this
orrespondene,thememorymultiplexorinterfaeallowsresourestoassoiate
ookies with memory alloations. An assoiated ookie is returnedwith eah
memoryrelamationrequest;thisookieaidsinloatingreferenestothemem-
orybeingrelaimed. Forexample,whenfralloatesmemoryforaleblok,a
referenetotheleservesastheookie. Thatway,ifthememoryisrelaimed,
thentheookieenablesthefrtoupdateitsinternaldatastrutures.
Ourimplementationassoiatesaseparateativitywitheahproess,so the
relamationpoliy of the memory multiplexor dierentiates among proesses.
Byinspeting alloationrequests,the memorymultiplexorandeterminehow
muh memoryeahresoureonsumesonbehalfof apartiularativity. Still,
makingrelamationdeisionsonduiveto improvedperformanetypiallyre-
quiresadditional information. Forexample, iffrequently usedmemory in the
proess heap is relaimed then performane will erode. Likewise, relaiming
proesstextmemorywillresultin poorperformane.
To obtain needed additional information, the memory multiplexor relies
on resoureinstrumentation, to produe resoure information reords. These
reordsprovide memoryusage statistisand other pertinentinformation. For
example,asrregularlyolletsthemodiedandaessbitsstoredbypageta-
bles. Similarly,asrinformsthememorymultiplexorwhethermemoryhasbeen
modied.
Theatofrelaimingmemorymightrequireupdatesinresouresotherthan
theonethat initially alloated thememory. Forexample, er relies onfr to
ahesegmentsoftheexeutablele. Moreover,erusesasrinordertoinsert
pagetabletranslationsforthosesegments.Hene,memoryforahingsegments
isinitiallyalloatedforfr,butreferenestothataheultimatelyexistinboth
thefrandtheasr. Inordertorelaimthismemory,updatesinasrandfr
are needed. The memory multiplexor oers an interfae for this. Using the
interfae,asrausesthememorymultiplexortodiretrelamationrequeststo
page table updates and forwards the request to the resoure responsible for
theorrespondingregion. In the aseof exeutablesegments,er will in turn
performitsinternalbookkeepingandthenforwardtherequesttothefr.
Assoiating a singleativity with all vmm-relatedrequests from aproess
does not prohibit asheduler from treating various types of proess requests
dierently. We have implemented shedulers for fr that reorder and delay
queuesaordingto thesending resoure;thisallowsVortexto favordemand-
pagingtraoverregulari/otrafromaproess. Itreduesthetimebefore
memoryis freed for reuse and also theduration aproess is bloked awaiting
arrivalofpagesnotpresent.
3.3 I/O
Vorteximplementsthe posix asynhronous i/o interfae. This interfae sup-
ports asynhronoustransferofdatabetweenbuersin aproessaddressspae
andakernelsupportedi/oresoure. Eahi/ooperationisdesribedbyadata
struturethatspeiesadesriptoronwhihtheoperationistobeperformed,a
pointertoadatabuer,andsomeindiationofhowtheallingproess/thread
shouldbenotiedonetheoperationterminates.
3.3.1 Asynhronous I/O
The posix asynhronous i/o interfae is largely implemented by the asyn-
hronousi/oresoure(aior). aiorabstratseah i/ooperationintermsofa
soureresourethatproduesdataandasinkresourethatonsumesdata. The
soureorresponds to theproviderof data for aregionin theproess address
spae in the ase of writes, and it orresponds to any i/o resourefor reads.
Thesink isanalogous.Theaiororhestratesdataowfrom soureto sink.
aiorrequestsdatafromasoureresourebysendingitareadrequest. The
soureinturnrespondswitharead_donerequestontainingthetargetdata.
A similar protool is used when interating with sink resoures. aior writes
datato asinkby sendingawrite requestto it, andthesink signalsthat the
datahasbeenonsumedbysendingawrite_donerequestbak. Souresand
sinksmayuseotherresourestosatisfyareadorwriterequestortointerat
withahardwaredevie.
i/ooperationsanexeuteonurrently. Prefethingandoverlappingintro-
dueorderingonstraintsamongrequestsbelongingtothesamei/ooperation,
beausedatamustarriveatasinkintheordersentbyasoure. aiorsolvesthis
problembyassigningthesamedependenylabeltoallrequestsderivedfromthe
samei/ooperation. Thus, multi-ore parallelizationoursat thegranularity
of i/ooperations.
Similar to Vortex' vmm system, aior sets the ativity binding of derived
requeststotherequestingproess. Byinheritane,allotherrequestsgenerated
aspartofthei/ooperationwill thenpointtothesameproess.
Interruptsare integralto the operationof many i/o devies. A resourethat
operatessuhani/odeviemustregisterwiththeInterruptResouretoreeive
interruptsoriginatingfromthedevie. Interruptsareinitiallyapturedbyalow-
level Interrupt Resoure handler, whih reates and sends a resoure request
desribingtheinterrupttotheappropriateresoure.
Resoureonsumption forinterruptsis attributed retrospetively. For the
low-levelhandler,instrumentationodereatesresourereordsto returnpu
timetoanyinterruptedativity. Similarly,instrumentationodeintheresoure
reeivingthe interrupt request produesresoure reordsfor retrospetiveat-
tribution,iftheausing ativityanbededued.
3.4 The proess, system alls, and threads
A resoure may export routines in its interfae that should be aessible not
only to other resoures but also to proesses. Suh funtions are exposed as
Vortex system alls. The resoure programmer ahievesexposure by using a
stub generation faility that, for eah funtion, reates a stub for onverting
asystem all into a resourerequest message sent to theresoure. The stub
alsodeouplessystemall argumentsfromanyproess-dependentontext. For
example, the stub translates virtual memory pointers to their orresponding
physial memory pointers, ausing page faults ifneessary to bring data pre-
servedbytheSwapResoureintophysialmemory. Refereneountingensures
thephysialmemorypointersarevalidforthedurationoftheall 5
.
System all messages from a proess originate from the Proess Resoure
(pr). The pr implements the onventional proess abstration,using asr to
handleaddressspaeoperations. Toimplementproessexeutionontexts,the
pr uses theThread Resoure(tr). tr provides an interfaefor onventional
threadoperations,suh asreate,exit,suspend,resume,join,et.
tr drivesexeution of threads by using resourerequest messages. When
athread enters the ready state, aresoure request is sent to tr, leading the
trshedulerto request pu time from apu multiplexor. When therequest
isdispathed,trloatestheontrolblokoftheorrespondingthread,setsup
atimeslie timer, and ativates the thread. After ativation, the thread runs
untilthetimeslieexpiresorablokingation isperformed. Whilethethread
isrunning,thepumultiplexorregardstrasexeutingrequests. (Preemption-
interruptsaredelivereddiretlyfromthelow-levelInterruptResourehandler,
sinesubjetingthesetoshedulingwouldrequireinvolvementofthepumul-
tiplexor.)
Only the proess address-spaeand system-all stubs are addressable to a
thread. Consequently,athreadannotsubvertashedulerbydiretly invoking
afuntion inaresoureinterfae.
Turningsystemallsintorequestsinreasesoverheadbutimprovessheduler
ontrol. Forexample,adiretly-invokedfuntion oulderodeshedulerontrol
5
Conurrentrelamationofmemoryisdelayeduntiltheallompletes.
dispathed requests. Yet, in some ases, exeuting a funtion does not in-
terfere with sheduler ontrol. Examples inlude alls suh as getpid() and
gettimeofday() and funtions in the tr interfae. To aommodate these
ases,theresoureprogrammerisallowedto onstrutstubsthat diretly all
funtionsintheresoureinterfae.
Diret invoation of funtions in a resoureould allowone servie to in-
terfere with others. For example, in Vortex, we primarily use interproessor
interrupts(ipis)todispathworkthat requiresimmediate exeutiononaspe-
iore. In an earlyimplementation, weused ipis to perform operationson
threads hosted by remote ores. This deision, however, enabled athread to
disruptworkbeing performed onall oresin the systemby spawning aseries
ofthreadoperations. Theurrentimplementationusestheipimehanismonly
when the target thread is running on a remote ore; otherwise, a request is
insteadsentto trresoure.
3.5 Resoure implementation
Kernel-levelprogrammingwithinVortexamountsto implementingresourere-
quest message-handlers and resoure shedulers. A typial message handler
mightreply to a request orsend arequest to another resoure. The fr, for
example,doesboth: itmayrespondwithadiskblokfromitsaheoritmay
sendarequestto alesystemresoure.
Toassist the kernel-programmer, Vortexoers support for several onur-
renyandontinuationmodelsforhandlingrequests.
Per-resoure bloking: Here,aresouremay temporarily suspend de-
liveryofrequests,whihthenaumulate attheiroriginalrequestqueues. Un-
blokinganbedonebyanotherresoureorbydeliveryofaninterruptrequest.
Thisstrutureisusefulforimplementingdriversfori/odevies,whoseapaity
maybeoasionallyexeededbytheowofrequests.
Per-request bloking: Whenonly somerequestsrequirebloking,per-
requestblokingismoreappropriate. Consider,forexample,aFileCaheRe-
sourethatontainssomeoftherequesteddiskbloksbutnotothers,requiring
a feth from a le system resoure. To support suh situations, the toolkit
introdues apending queue. Whena resoureneedsto blok aninoming re-
questuntilitreeivesareplytoitsoutgoingrequest,theresoureanplaethe
inomingrequest intothe pending queue andattah atrigger tothe outgoing
request. Triggerspointtooneormorerequestsinthependingqueue. Resoures
arerequiredtoinludethetriggerintheirreplytoarequest,sothetoolkitan
unblokthereferenedrequest automatially whenthereply arrives. Multiple
requests an be assoiated with the same trigger, allowing multiple requests
fromthesameativitytobeunblokedsimultaneously.
Expliit ontinuations: In resoures with several potential bloking
points,per-requestblokingmayauseredundantre-exeutionofodeafterun-
bloking(sine exeutionalwaysstartsat thebeginning). Forexample, in the
Vortexext2lesystemresoure,arequestmayhavetobeblokedthreetimes,
Tohelpavoidsuhredundantre-exeution,oursystemallowsblokedrequests
toarryapointertoahandlerroutinethatresumesexeutionafterunbloking.
Cooperative threading: When a resoure uses expliit ontinuations
with alarge numberof bloking points, the ode is split into many funtions
withoutalearontrol owbetweenthem. Cooperativethreading allowspro-
grammerstouseblokingoperationsinresouresbysavingandreoveringthe
statebehind thesenes. To useit, aresourewould typially spawn for eah
request a separate thread, whih would exeute for aslong as the request is
beingproessed.
4 Evaluation
Vortexis implementedin Cand, exludingdevie drivers, omprisesapproxi-
mately
70000
linesofode. Thesystemrunsonx86-64multi-orearhitetures.Thequestions wehopedtoanswerinourevaluationofVortexwere:
1. Isallresoureonsumptionauratelymeasured?
2. Isresoureonsumption attributedtotheorretativity?
3. Does the arhiteture permit suient ontrol for shedulers to isolate
ompeting ativities?
Inallexperiments,VortexwasrunonaDell PowerEdgeM600bladeserver
withtwoIntelXeonE5430Quad-Coreproessors.Coresrunat 2.66GHz,have
separate 64x8 way 32KB data and instrution ahes, and, in pairs, share a
6MB 64x24 way ahe (for a total of 4 suh ahes). Eah proessor has a
1333MHz front-side bus and is onneted to 16GB of DDR-2 main memory
runningat 667MHz. Through itsPCIe x8 interfae, the serverwasequipped
with two 1Gbit Broadom 5708Snetwork ards. And, to the integrated LSI
SASMegaRAID ontroller,two146GBSeagate10K.2diskswereattahedand
setupinaraid0(striped)onguration.
Togenerate load,weused a luster of blade serversrunningLinux 2.6.18.
Thesewereofthesametypeandhardwareongurationastheserverrunning
Vortex,andtheywereonnetedto theVortexserverthroughadediated HP
ProCurve4208Gigabitswith.
4.1 Measurement tehnique
Usingasystemallinterfae,aproessanobtaindataonitsownperformane
and,subjettoongurableaessrights,theperformaneofotherproessesin
thesystem. These performane dataare obtainedfrom shedulers throughan
interfaethattheyarerequiredtosupport(showninTable3oftheAppendix).
For eah lient of a sheduler, the data inludes attributed pu and memory
onsumption and, if used, onsumption as attributed by the sheduler using
otherperformanemetris.
proessonVortex. Thisproesswasgrantedfullaesstoallperformanedata
in the system and exported this data upon request using tp. External to
Vortex, a sript ommuniated with the proess, olleting samples one per
seond. The size of eah sample was around 100KB; whenever possible, the
sriptaessedanetworkinterfaeard notativelyused inanexperiment.
When a proess performs a system all to obtain performane measure-
ments, Vortex returns measurements timestamped with the urrent value of
thepu timestampounterregisterofore
0
. These timestampsorrelatepu measurements with elapsed time; disrepanies reveal unattributed pu on-sumption. Retrospetive attribution ompliates things. Some samples indi-
ateunder-attributionwhileothersindiateover-attribution,ifthereisongoing
resoure-onsumptionwhenthesamplesareobtained. Dataauray,however,
isbounded bytheonsumptioninurredbyproessingonerequestmessage.
Mostmessagesanbeproessedbythepuin afewmiroseonds,ausing
aurayto bein thesameorder. Thread-readymessages,however,may lead
to several milliseonds of uninterrupted pu onsumption. The auray of
performanedatapertainingthreadsandtheoverallpu-time onsumptionon
oresthat runthreadsdependsuponhoieofthread timeslies. Forexample,
withthreadtimesliessetto
5
milliseonds,theexpetedaurayis± 0.5%
for individualsamples. Weveried that ourmeasurementsare in agreementwithexpetedauraybyperformingaseriesofexperimentswithaproessrunning
onepu-boundthreadperoreandvaryingthedurationoftimeslies. Inthese,
wefoundnosamplestobeoutsideexpetedauray.
Individualsamplesmaybeinaurate,butunder-attributioninonesampleis
ompensatedforinthenextsample. Thus,foraseriesofonseutivesamples,a
deviationbetweenresoureavailabilityandattributionlargerthantheexpeted
aurayofanindividualsampleindiatesthatsomeonsumptionisnotbeing
properlyaountedfor. Intheaforementionedexperiments,omparingthesum
of elapsed to the sum of attributed yles shows the number of unaounted
ylestobewithin theexpeted aurayofindividual samples. Forexample,
in one experiment, over
100
seonds, a total of86, 028, 592
yles were not aountedfor(0.004%
ofelapsedyles). Thiswaswithintheexpetedauray ofanindividual sample(± 106, 400, 000
yles).Duringanexperiment,weensuredthattheonlyproessesrunningonVortex
were those involved in the experiment itself. We ran eah experiment
10 20
timesto verify thepreisionofperformanedata; deviations were foundto be
within the auray of individual samples. For larity, we therefore do not
inludeerrorbarsingures. Also,foreaseofvisualinterpretation,somegures
wereproduedusingGnuplotwiththedgrid3dommand 6
.
6
Indgrid3dmode,griddatapointsrepresentweightedaveragesofsurroundingdatapoints,
withloserpointsweightedhigherthandistantpoints.
cpuhog50 cpuhog33 cpuhog17
0 10 20 Time (seconds) 30 40 50 60 70 80 90 100
0 1 2 3 4 5 6 7 Core # 15
20 25 30 35 40 45 50 55
% CPU utilization
Figure 8: pu utilization runningthree pu-bound proesseswith
50%
,33%
, and17%
puentitlementandpumultiplexorsonguredwithwfqshedulers.4.2 Attributing CPU onsumption
Toevaluatewhetherpuonsumptionisbeingattributedtotheorretativity,
weondutedanexperimentinvolvingthreepu-boundproesses. Eahproess
ranonepu-bound thread perore. Reall from Setion 3.4 that threads are
implemented by the Thread Resoure (tr). The tr drives the exeution of
threadsbyproessingtherequestmessagessentto itwhenathread enters the
ready state. Proessing a message involves setting up a timeslie timer and
dispathingtheorrespondingthread. Toisolateproesses,Vortexreatesone
trinstane perproess. Eah trinstane operateswithaseparate sheduler
thatmanagesthreadsbelonging toaorrespondingproess 7
.
In the experiment, pu multiplexors use a weighted fair queueing (wfq)
sheduler and assign weights to tr instanes of the proesses aording to a
50%
,33%
, and17%
entitlement. For the tr shedulers, we used a simple round-robinshedulerwithaloadsharingalgorithmtherebyensuringthatpro-essthreads runon separateores (i.e.pu multiplexorbindings withinnite
durationandinitialbindingalwaysassignedto theorewiththeleastnumber
ofthreads bound to it). Figure 8illustrates theresultingpu utilization: the
pumultiplexorwfqsheduleroneahoreallotsputimeto trshedulers,
whih in turn exeute proess threads, in strit aordane with the desired
50%
,33%
,and17%
entitlement.7
Thisavoidssenarios where,for example,a proess reateslotsof threadsinorderto
inreaseshedulingoverheadforotherproesses.
4.3 Attribution and isolation under ompetition
Theprevious experimentdoesnotestablish whether pu onsumption is or-
retlyattributed when aresourereeivesrequestsfrom multiple independent
ativities.
To evaluate attribution-auraywhen a resoureproesses requests from
independent ativities, weonduted an experimentwith three proessesper-
forming le reads. The proesses eah ran one thread per ore, with threads
programmedto onseutively open a designated le, read 32KB of data, and
thenlosethele. Toperformaread,threeresouresareinvolved 8
(inaddition
to the tr instanes): the Address Spae Resoure (asr), Asynhronous i/o
Resoure(aior),andtheFileCaheResoure(fr).
Due to the few les involved, the experiment is pu-bound. And sine
threadsawaittheompletionofonereadoperationbefore performinganother,
throughputisdependentontheamountofpuavailabletothethreadsandthe
threeresouresinvolved.
Intheexperiment,weonguredaresouregrid,asillustratedinFigure 9,
withseparatewfqshedulers fortheasr, aior,andfrresoures. pu on-
sumptionwasused asa metri. pu multiplexors hadwfq shedulers, where
shares gave the three resouresa minimum of
50%
of pu resoures (shared equallyamongthemselves). Theremainingpuresoureswereassignedtopro-essesaordingto a
50%
,33%
, and17%
entitlement. The sameentitlement wasused fortheproessesattheasr,aior,andfrshedulers.Figure 10 shows pu utilization at the dierent resoures involved in the
experiment. Wesee that thebulk ofpu onsumption isby the threads (ap-
proximately
45 + 30 + 15 ∼ = 90%
). This isdue to howVorteximplements the 8Aftertherstreadoperationthe targetle isahedinmemorybythefr. Thus,in
thefollowingweignoreanyotherlesystemrelatedresoures.
AIOR ASR FSR TR filehog50 TR filehog33 TR filehog17
0 10 20 Time (seconds) 30 40 50 60 70 80 90 100
0 1 2 3 4 5 6 7 Core # 0 5
10 15 20 25 30 35 40 45 50
% CPU utilization
Figure10: Breakdownofpu utilization.
posix asynhronousi/o apiVortex avoids opyoperations onthei/o path,
makingreaddataavailabletoaproessthrougharead-onlymemorymapping.
For the reeiving thread, data is opied into the buer speied in the aio
ontrolblokdesribingtheoperation.
Figure11showsabreakdownoftherelativepuutilizationattributedtothe
proessesatallresouresandthethreads. FromFigure11(a)weonludethat
pumultiplexorwfqshedulersoperateasexpeted;threadsauratelyreeive
exesspuresoures,i.e.entitledresouresnotusedbytheasr,aior, orfr,
proportionally to their
50%
,33%
, and17%
entitlement. The pu resoures availabletothethreadstranslateintoaorrespondingpuonsumptionattheasr,aior,andfrresoures,asshownin gures11(b)(d).
So,theexperimentnotonlydemonstratesthat resoureonsumptionisa-
uratelymeasuredandattributed (goal1and2ofSetion4),but alsothat the
shedulershavesuientontrolto isolateamongompetingativities(goal3
ofSetion4).
4.4 Web server workloads
Wefurtherinvestigateattributionandisolationunderompetitionbyonsider-
inganexperimentwith(1)shedulersusingmetrisotherthanputime(bytes
writtenandread),(2)resoureonsumptionthatisinherentlyunattributableat
thetimeofonsumption(paketdemultiplexingandinterruptproessing),and
(3)ani/odevieratherthanthepuasabottlenektoinreasedperformane.
The experiment also exerises a larger number of resoures and represents a
morerealistisituationthanthemiro-benhmarksdisussedabove.
filehog50 filehog33 filehog17
0 10 20 Time (seconds) 30 40 50 60 70 80 90 100 0 1 2 3 4 5 6 7 Core # 15
20 25 30 35 40 45 50 55
Relative CPU utilization
(a)ThreadResoure.
filehog50 filehog33 filehog17
0 10 20 Time (seconds) 30 40 50 60 70 80 90 100 0 1 2 3 4 5 6 7 Core # 15
20 25 30 35 40 45 50
Relative CPU utilization
(b)FileCaheResoure.
filehog50 filehog33 filehog17
0 10 20 Time (seconds) 30 40 50 60 70 80 90 100 0 1 2 3 4 5 6 7 Core # 15
20 25 30 35 40 45 50
Relative CPU utilization
() AsynhronousI/Oresoure.
filehog50 filehog33 filehog17
0 10 20 Time (seconds) 30 40 50 60 70 80 90 100 0 1 2 3 4 5 6 7 Core # 15
20 25 30 35 40 45 50 55
Relative CPU utilization
(d)AddressSpaeResoure.
Figure11: Breakdownofrelativepuutilization.
Web serversoftware thttpd 9
was run,with modiationsto exploit Vor-
tex's asynhronous i/o api and event multiplexing mehanisms. thttpd is
single-threadedandevent-driven. Togenerate loadtothewebservers,weran
ApaheBenh 10
onthreeseparateLinuxmahines. Oneahmahine,ApaheBe-
nh wasonguredto generaterequests forthesame
1
MB statiwebpage re- peatedly and with aonurrenylevel of16
. Priorto theexperiment, testing revealed ApaheBenh ould saturate a1
Gbit network interfae even from a singlemahine. ThethreeLinuxmahinesouldtogethergenerateloadwellinexessofnetworkinterfaeapaity.
Table 1 lists the Vortex resoures used by the web servers. By default,
Vortex manifests a network devie driver as tworesoures: the Devie Write
Resoure (dwr) and the Devie Interrupt Resoure (dir). In the ase of a
networkinterfaeard(ni)driver,insertionofpaketsintothetransmitringis
performedundertheauspiesofdwr.Transmit-nishedproessingandremoval
ofreeivedpaketsfromthereeiveringishandledbydir.
dirreeivedpakets,intheformofrequestmessages,aresenttotheNetwork
DevieReadResoure(ndrr)fordemultiplexing. Byinspetingpaketheaders,
ndrrdetermineswhetherapaketisdestinedforanopentponnetion,isa
9
http://www.ame.om/software/thttpd/thttpd.html
10
http://www.apahe.org/
Resoure Desription
DevieInterruptResoure(DIR) NICinterruptproessing
DevieWriteResoure(DWR) Insertpakets intoNICtxring
NetworkDevie WriteResoure(NDWR) Insertethernetheaderintopaket
NetworkDevie ReadResoure(NDRR) Demultiplexinomingpakets
TCPResoure(TCPR) ProessTCPpakets
TCPTimerResoure(TCPTMR) ProessTCPtimers
Asynhronousi/o Resoure(AIOR) Orhestrateasynhronousi/o
FileCaheResoure(FCR) Fileahing
AddressSpaeResoure(ASR) Addressspaemappings
synpakettargetingaonnetionin thelistenstate,orisapaketthatshould
be dropped. Ifa tponnetion is found, then thepaket is sent to thetp
Resoure(tpr) forfurtherproessing. Notethat proessingbybothdirand
ndrr is onsidered infrastruture; the ativity to attribute is determined by
ndrraspartofdemultiplexing. Alsonotethatthereisnoseparateipresoure.
Sine ip ode is used only in onjuntion with reating tp or udp paket
headers,ipisaesseddiretlyinsteadofmanifestedasaresoure.
AsdesribedinSetion3.1.1,resouresassignrequestanitylabelstogive
shedulershintsaboutpumultiplexorpreferenes,andtheyassigndependeny
labelstoontrolrequest-proessingorder. Whenapaketisremovedfrom the
nireeivering, ananityand dependenylabel areassigned to therequest.
ndrr and tpr both aess elds in the paket header and the tp ontrol
blok. Sofor performanereasons,paketsbelongingto thesametponne-
tionideallywouldbeproessed onthe sameore. tprproessing ofpakets
in ni-dequeue orderis notarequirementfororretnessbut anprevent un-
neessarytp ommuniation. Forexample, thedefault poliy fortp when
reeiving out-of-orderpakets is to reply with an ak paket (whih, in turn,
mighttriggerfast retransmit). Also, the Vortextp stak ontains theusual
fast-pathoptimizationsforin-orderpaketproessing.
Topreservepaketordering,paketsfrom thesametponnetionareas-
signed the same dependeny label at intermediate resoures. Reall that the
sheduler toolkit only guarantees ordering between asending and areeiving
resoure. Toensure thatpaketsare proessedon thesameore, identialde-
pendenylabelsareassignedarossallintermediateresoures.
For inomingpakets,thedir determinesdependenylabels byinspeting
paketheadersandomputingahashofthesendingandreeivingipaddresses
andtp ports. The omputedlabel,whih isidential for allpakets belong-
ingto thesame tp onnetion,is inherited by all intermediate resoures. If
paketproessingreatesanewtponnetion,thenthatlabelisstoredinthe
tp ontrol blok and attahed to any paket sent. Thedependeny label is
omputedaordinglyforonnetionsreatedbyproessesrunningonVortex.
In the experiment, we ongured pu multiplexors with wfq shedulers.
Resoures at eah pu multiplexor were ongured with a
50%
entitlement (shared equally among themselves), with the remaining apaity split amongwebserversaordingto a
50%
,33%
,and17%
formula. Sinethe webservers aresingle-threaded, theyonlydrawpu resouresfrom oneore. Topromoteompetition, weongured tr shedulers with a load sharingalgorithm that
seletedthesamepu multiplexor forallthreads (ore
7
). Theresouregrid, shown in Figure12, wasongured withseparate wfqshedulers for eah re-soure. Ateahresoureshedulerweonguredtheinfrastrutureativitywith
a
50%
entitlement,withtheremainingsplitamongthewebserversaordingto a50%
,33%
,and17%
formula. Furthermore,shedulerswereonguredtouse puonsumptionasametri,exeptforthendrr,ndrw,anddwrshedulerswhih were onguredto use bytes transferred. The dwr resoureis instru-
mentedtoemitaresourereordwheneverawriteoperationisaeptedbythe
underlyingdriver(i.e.,apaketsuessfullyinsertedintothenitransmitring).
Likewise,thediremitsaresourereordwhenareadoperationompletes.
In Vortex, a resoure with insuient apaity rejets a request. Upon
rejetion,theshedulertoolkitplaestheorrespondingresoureinasuspended
stateandrequeuestherejetedrequestintheoriginatingqueue. Untilresumed,
nonewrequestsaresenttotheresoure.Fortheniinoursystem,dwrrejets
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100
0 10 20 30 40 50 60 70 80 90 100
Throughput (Mbits/s)
Time (seconds)
Total
httpd50
httpd33
httpd17
Figure13: Byteswritten atthedwr resoure.
a request if the ni's single transmit ring is full, after whih dwr remains
suspended untildirhasperformedwrite-ompletionproessing. dwrapaity
islimitedbythespeedatwhihthenianopypaketsfromthetransmitring
tothenetwork. Moreover,sineaesstothenitransmitringisserializedby
alok,onlyasingleoreaninsertpaketsatanygiventime. Thus,onguring
thedwr to request pu from multiple pu multiplexors would only result in
exessive ontention on the ni lok and not in inreased apaity. For this
reason, we ongured the dwr sheduler to request pu only from a single
pu multiplexor (ore
6
). Even when the niis running at full apaity and the dwr is frequently suspended awaiting dir proessing, dir proessing islikelyto overlapwithattemptsto insertpaketsinto thetransmitring. Thus,
dir proessing is best performed on the sameore as dwr to avoidni lok
ontention 11
.
Figure 13 shows how network bandwidth is shared at the dwr resoure
duringourexperiment. ThedemandforbandwidthgeneratedbyApaheBenh
is thesame for allweb servers. However,the atual bandwidth onsumedby
eahwebserverdependsonitsentitlement,aswedesired. Moreover,notethat
the total bandwidth onsumed is lose to the maximum apaity of the ni,
onrmingthat theworkloadisi/obound.
Figure14breaksdownpuutilizationarosstheinvolvedresoures. Forthis
workload,
28.3
% of available pu yles(the equivalentof2.26
ores) is on- sumed. Notsurprisingly,thebulkofpu onsumptionisbytpandresoures11
Whendirproessing runs on a dierent ore fromthe dwr, we measured an overall
5.5%
inreaseinpuonsumption. Lokprolingfurther showedthat the inreasewas all attributabletonilokontention.downstream. Consumption of
14 . 24%
of available pu yles (the equivalent of1 . 13
ores)anbeattributedto infrastruture. Of this,7 . 2
% (0 . 58
ores)is interrupt(i.e.dir)proessingandtheremainderis paket demultiplexing(i.e.ndrr proessing). dir proessing takes plae on ore
6
; ndrr proessing is load-shared among ores due to anity label assignment. Observethat dwrproessinghasarelativelyxedost;whennioperatesatmaximumapaity,a
relativelyonstantnumberofpaketsneedstobetransmitted(wheretheexat
numberdependsontpdynamis). Inontrast,theostofinterruptproessing
is heavily inuened by thefrequeny of interrupts,whih is bounded by the
rateat whihpaketsareremovedfrom thenitransmitring(i.e.atmostone
interruptperpaket sent). (The numberof interruptsdue to pakets reeived
has the same bound, but a ni operating at maximum transmit and reeive
apaityisnotlikelytoinreaseinterruptfrequenysinethedriverwouldo-
alesereeivewith transmit proessing. And the ni in our system does not
haveseparateinterruptvetorsfortransmitandreeive.)
Intheexperiment,oresweremeasuredtooperateatapproximately
15 ± 3
% utilization, whereas ore6
operated at100
%. Core6
might appear to be a bottlenek,butFigure13showsthattheniisoperatingatmaximumapaity,as desired. On ore
6
,28
% of utilization is due to dwr proessing,58
% dir proessing, and the remaining is due to other resoures. Sine the ni usesmessage-signaledinterrupts, interruptsan be deliveredwith low lateny and
ataratemathingpaket transmission. Forthisexperiment,thedirproesses
approximately
7300
interruptmessagesperseond. Inontrast, tptransmits approximately82000
pakets and reeives24000
inoming pakets perseond.Thus, overhead related to removalof sent pakets from theni transmitring
is amortized over approximately
11
pakets on average. Reduing the load on ore6
would only result in morefrequent serviing of interrupts, leading to more frequent interrupts, whih in turn inreases pu onsumption. Weexperimentallyveried this feedbakeet by only runningthe dir and dwr
onore
6
. Itsloadstayedat100
%. Theslightlyreduedper-interruptoverhead wassubsumedbytheinreasednumberofinterrupts.Vortex requires resoures to handle onurrent exeution of requests. In
ourimplementation, we use spin-loks to preserve invariants on sharedstate.
Forthisexperiment,anaverageof
1, 770, 000
lokoperationsareperformedper seond.Themajorityprotetrequestqueueoperations. Lokprolingdidshowsome lok hotspots, indiating a need to re-visit synhronization approahes,
butoveralllokontentioninthisexperimentwasfoundtobelow(i.e.fewpu
ylesarespentbusy-waitingonloks).
Despite low lok ontention, the aggregatedoverhead oflok operationsis
signiant. Forthehardwareweareusing,obtainingandreleasingalokwhen
theoperationanbeexeutedinternallyinaore'saheinvolvesapproximately
210
puyles. Inpratie,duetotheneedforinter-oreommuniationwhen performinglokoperations,prolingshowstheaveragelokingoverheadto be738
pu yles. In total,22 . 2
% of onsumed pu yles are attributable to loking overhead and ontention. Inontrast,had all lokingoperationsbeenexeutedinternallyin aore'sahe,only
6.3
% ofonsumedpuyles would0 1 2 3 4 5 6 7 8
address_space asynchronous_io device_interrupt device_write file_cachenetwork_device_read network_device_write statistics tcp tcp_timer httpd17 httpd33 httpd50
% CPU Utilization (out of 8 cores)
httpd50 httpd33 httpd17 infrastructure
Figure 14: Breakdownofpuonsumption.
have been attributable as suh. The latter is to some extent optimisti, but
undersoresthat synhronizationisostlyinamulti-oreenvironment.
4.5 File system workloads
We ontinue by onsidering an experiment involving le i/o. Similar to the
web serverexperiment above, this experiment involves shedulers using bytes
transferredasametri,interruptproessing,andani/odevieasabottlenek
toinreased performane. Theexperimentdiers by(1) introduing aforeign
sheduler outsidediret ontrol of Vortex(the disk ontrollerrmware shed-
uler), (2) i/o devie apaitythat utuates depending onhow the devie is
aessed (i.e.whih disk setors are aessed and in what order), and (3) i/o
requestsof markedlydierentsizes 12
.
Theexperimentaldesigninvolvedthreeproessesperforminglereads.The
proesseseahranonethread perore,withthreadsprogrammedto readon-
urrently from
32
dierent,2
MB, les. Eah le was onseutively opened, readusing4
parallelstreamsfromnon-overlappingregions,andthenlosed. To ensurethat the experiment was disk-bound, eah le was evited from mem-ory ahesafter it hadbeenread 13
. Eahproessthus maintainedonurrent
readoperationsfrom
256
dierentles, foratotal768
les altogether. Before 12Beforeoptimizationsperformedbythediskontrollerrmware, Vortexemploysan op-
timizationwherebyi/otoadjaentbloksisoalesed. Thisisanoptimizationemployedby
mostoperatingsystems.Vortexrestritstheoptimizationtorequestsbelongingtothesame
ativityandlimitstheresultingrequeststoenompasstransferofatmost
32
KBofdata.13
Vortexsupportsne-grainedmanagementofahedles;mehanismsanreatehek-
pointsofthelesystemandevitlestateatthegranularityofindividuallesoraggregates
oflesusedbyspeiativities.
Resoure Desription
Devie InterruptResoure(DIR) Interruptproessing
DevieReadWriteResoure(DRWR) Insertreadorwriterequests
StorageDevieReadWriteResoure(SDRWR) Buertranslations
SCSIResoure(SCSIR) SCSImesssages
StorageResoure(SR) Exportdiskvolumes
EXT2Resoure(EXT2R) Ext2 lesystem
FileCaheResoure(FCR) Fileahing
Asynhronousi/oResoure(AIOR) Orhestrateasynhronousi/o
AddressSpaeResoure(ASR) Addressspaemappings
theexperimentwasstarted,anemptylesystemwasreatedondiskandles
werethenreatedandsynedtodisk. Fileswerereatedonurrentlytoavert
sequentialleblokplaementondisk 14
.
Table2liststheVortexresouresusedbytheproesses. Vortexmanifestsa
storagedeviedriverastworesoures:theDevieReadWrite Resoure(drwr)
andtheDevieInterruptResoure(dir). Insertionofdiskread/writerequests
isperformedbydrwr andrequest nishedproessingis handledby dir. The
StorageDevieReadWriteResoure(sdrwr)interfaesthestoragesystemwith
drwr. In partiular, sdrwr translates between storage-spei request and
data-buerrepresentationsandtherepresentationsthat areusedbyallVortex
devie drivers 15
. Sine the disks in our system were ssi-based, all requests
passedthroughthessiResoure(ssir)fortheappropriatessimessagere-
ationand response handling. ssir issituated upstreamof sdrwr and down-
streamoftheStorageResoure(sr). srabstratsdierenesindisktehnology
byprovidinganamingshemeandageneralblok-basedinterfaetoadiskor
diskvolume. Forexample,afterssirhasprobedtheunderlyingssitopology,
disovereddisks and raid volumes are registeredwith srasstorage volumes,
wherebyalesysteman beassoiatedwith them orrawaess anbemade
by e.g. le system reation and reovery tools. The Ext2 Resoure (ext2r)
is upstream of sr and implements the Ext2 le system on a storage volume
providedbysr. TheFileCaheResoure(fr)initiallyreeivesleoperations
andommuniateswithext2rtoretrieveandupdatelemeta-dataanddata.
Toensureaonsistentstateondisk, lesystemstypiallyrestrithowdisk
requestsanbeorderedaftersent. ext2rusesdependenylabelstosatisfyits
orderingonstraints. Requestsinvolvingbloksthatareprivatetoale(i.e.disk
bloktableanddatabloks)areassignedthesamedependenylabelbyext2r
andintermediate resoures,ausingrequests to arriveat thedisk in theorder
14
Asequential leblokplaementwouldresultinthemajorityofdiskrequeststobeof
thesamesizeduetooalesingofreadstoadjaentbloks.
15
Vortex denes ageneral request anddata-buer interfaethat alldeviedrivers must
adhereto.
sent . Notethatext2rassoiatestheoriginatingativitywiththeserequests;
externalsynhronizationprotoolsareassumedwhendierentativitiesoverlap
i/otoale. Forbloksontaininginformationpertainingtomultipleles(i.e.
inode bloks and free inode- and free-bitmap bloks), ext2r assoiates the
infrastrutureativitywithrequestsandassignsdependenylabelssimilarlyto
privatebloks. Useoftheinfrastrutureativityisneededfor onsistentstate
ondisk 17
, beausethe toolkit onlyguaranteesordering for requests belonging
tothesameativity.
Intheexperiment, pu multiplexorswere onguredwith wfqshedulers.
The resoure grid was ongured with separate wfq shedulers for eah re-
soure. Resoureswere givena
50%
entitlementateahpumultiplexor,with theremainingapaitysplitamongtheproessesaordingtoa50%
,33%
,and17%
formula. Theinfrastruturewasgivena50%
entitlementateahresoure,with theremaining split among proessesaordingto a
50%
,33%
, and17%
formula. Shedulers for resoures downstream of fr were ongured to use
bytestransferredasametri,sine,forthesetypesofresoures,puonsump-
tionisnotrepresentativeofatualresoureonsumption(seeSetion 2.2). For
thesamereasonsasthose outlined inthe webserverexperimentabove,drwr
anddir were onguredto request pu from asingleore(ore
6
). Thedisk rmwarewasonguredtohandleupto256
onurrentrequeststoallowample opportunities forrmwaretoperformoptimizations.Figure15showshowdiskbandwidthissharedatthedrwrresoureduring
theexperiment. Beausediskapaityvariedarossrunsdueto hangesinle
blokplaement,thegureshowsrelativebandwidthonsumptionforthethree
proesses. Thedemand for bandwidthis thesame forallthree proesses, but
asdesiredandseen,atualallotmentdependsonentitlement.
Figure 16 breaks down pu utilization arossthe involved resoures. For
thisworkload,only
0 . 99
% ofavailablepuyles(the equivalentof0 . 08
ores) is onsumed, whih learly shows that the disk is the bottlenek to improvedperformane.
5 Related work
5.1 Sheduling CPU
One Vortex objetive is to provide a exible framework for shedulers that
supports a wide varietyof poliies. Prior work has also explored support for
multiple, oexisting sheduling poliies. In ontrast to Vortex, the fous of
that work was guaranteeing pu yles for proesses. Of partiular relevane
to Vortex is work that investigates interation between shedulers organized
16
Software-basedrequestorderingtoreduediskheadmovementmightresultinadierent
disk-arrivalorder,but,similartooptimizationsperformedbydiskrmware,theorderingmust
satisfyonsistenymodels.
17
TheFileCaheResoureguaranteesthatnoreadsorwritesareinprogresswhensendinga
requesttoext2rthatinvolveslemeta-dataupdates. Thisrelievesext2rfromimplementing
logiforsynhronizingpendingreadsorwriteswithmeta-dataupdates.
15 20 25 30 35 40 45 50 55
0 10 20 30 40 50 60 70 80 90 100
Relative bandwidth consumption
Time (seconds)
fshog50
fshog33
fshog17
Figure15: Bytesreadatthedrwrresoure.
in a hierarhy, beause of the similar hierarhial relationship between pu
multiplexorsandresoureshedulers.
Goyalet al.[28℄presentoneofthersthierarhialshedulingsystemsthat
allows dierent algorithms for dierent appliations. The system uses a fair
queuing algorithm at alllevelsof the shedulinghierarhy,exept for theleaf
nodes. Leafnodesmay implement arbitraryshedulingpoliies, muh like the
Thread Resoure shedulers in Vortex. The open environment for real-time
appliations[19,20℄and bssi[40℄restritthenumberoflevelsinthehierarhy
to two, and thesesystems relyonan earliestdeadline rst(edf) sheduler at
theroottoresolvetimingonstraintsofappliationshedulers. RED-Linux[65℄
denesshedulingneedsoftasksin termsofattributes,whih maybeadjusted
toexpressdierentreal-timepoliies(edf, ratemonotoni,et.). Coneptually
thisdenesatwo-levelshedulinghierarhy.
pu inheritane sheduling[27℄ allowsonstrution ofarbitrary sheduling
hierarhiesbydesignatingertainthreadsasshedulerthreadsandotherthreads
aslientthreads. Shedulerthreadsimplementshedulingpoliiesbydonating
putimetolientthreads.Alientthreadan,inturn,atasashedulerthread
by donating its pu time to other threadsa onept originally introdued
in[17℄.puinheritaneshedulinganbeviewedasageneralizationofsheduler
ativations[1℄,onlyextendedwithpartsoftheshedulinghierarhyresidingat
kernel-level(although,theoriginalpuinheritaneworkonlydesribesauser-
levelimplementation). Nemesis[30℄,Aegis[24℄,andspin[8℄allimplementtwo-
levelshedulerhierarhieswithinterfaessimilartothatofshedulerativations.
Nemesisand Aegis require allseond-level shedulers to run at user-leveland
use axed sheduler at the root of the hierarhy; spinallowsappliations to
downloadtheirownshedulers intothekernelatrun-time.
Hierarhial loadable shedulers [55℄ (hls) and Vassal [13℄ both allow a
0 0.05 0.1 0.15 0.2 0.25
address_space asynchronous_io device_interrupt device_readwrite ext2 file_cachescsi statistics storage_device_readwrite storage_readwrite fshog17 fshog33 fshog50
% CPU Utilization (out of 8 cores)
fshog50 fshog33 fshog17 infrastructure
Figure 16: Breakdownofpuonsumption.
sheduler,downloadedintothekernelatrun-time,toontrolshedulingofavail-
ablethreads. Vassal onlyallowsa singlesheduler to o-existwith thenative
WindowsNTsheduler;hlsallowsarbitraryshedulerhierarhiesinWindows
2000. The hlsauthorsobservethat i/oativitiesseverely aetthe eetive-
nessandaurayoftheirpusheduling. Thisproblemisexpliitlyaddressed
in Vortex, beause it was designed to enfore poliies for both pu and i/o
onsumption.
5.2 Sheduling CPU and other resoures
Mostoperatingsystemshavewell-denedinterfaesforalloatingpu timeto
threadsorproesses,andtheshedulingalgorithmsmaybemodiedin arela-
tivelystraightforwardmanner. Inontrast,there isamultitudeof frameworks
andmehanismsforontrollingonsumptionofotherresoures.TheLinuxker-
nelusestimers,allouts,threads,andsubsystem-speiframeworkstodispath
workonbehalfofappliations. Asaresult,workthataimstomakeallresoure
onsumption shedulable in an existing system must overome the disparities
ofadiverseset ofmehanisms. Ifonlyertainresouresaremadeshedulable,
then inevitably there will be be ways to irumvent poliy enforement. For
example, ifonly network bandwidth is sheduled, then aweb serverould be
preluded from reahing its potential throughput by another disk-bound ap-