Design Principles for Isolation Kernels

(1)

Design Priniples for Isolation Kernels

∗

ÅgeKvalnes

1

DagJohansen

1

Robbert vanRenesse

2

Fred B. Shneider

2

Steen Viken Valvåg

1 1

University of Tromsø

2

CornellUniversity

TehnialReport2011-70

ComputerSieneDepartment

UniversityofTromsø

Abstrat

An operating system must ensure that no hosted servie an ause

theservielevelagreementofanothertobeviolated. Ifontrolisinom-

plete,noamountofover-provisioninganompensateforitandtherewill

inevitably be ways to irumvent poliy enforement. Still, ompeting

serviesareoftenonsolidatedonthesamemahinetoredueoperational

osts. This artile presents designpriniples for onstruting operating

systemswhereall resoureonsumption isundershedulerontrol. The

viabilityofthepriniplesservingasadesign-foundationissubstantiated

throughthe implementationofa newoperatingsystem kernel thatpro-

vides ommodity operating system abstrations. Using this kernel, the

eayofthepriniplesisexperimentallyorroborated.

1 Introdution

Appliationservieprovidersandhostedserviestypiallyrunserviesonshared

mahinestoredueoperationalosts[2,3,14,15,21,56,63℄.Theperformaneof

oneservieis then vulnerable to loadsurges in other servies. So, aprovider

mightviolateservie-levelagreements(slas),leadingtolostustomersormon-

etarypenalties[43,44℄.

Theonventional approahto meetingperformaneguaranteeshasbeento

quantify softwareand hardwarerequirementsmetiulouslyandthen toimpose

admission ontrol and resoure reservation. This works well if loads an be

antiipated. But hosted servies typially are subjet to unpreditable load

surgesandtime-varyingresouredemands. Toaommodatesuhhighvariane

byusingreservationsauseshardwareutilizationtosuer.

∗

Thiswork was supported inpart bythe Researh Counil of Norway as a Center for

Researh-basedInnovation(iAD),ONRgrantsN00014-01-1-0968N00014-09-1-0652,AFOSR

grantF9550-11-1-0137, AFRL,NSFgrants0430161, 0964409, CNS-0828923, CNS-1040689

(Nebula),andCCF-0424422(TRUST),andagiftfromMirosoftCorporation.

(2)

another to be violated is alled an isolation kernel [51℄. The kernel typially

providesinstrumentationforattributingresoureusagetoindividualhostedser-

viesandemploysshedulersthatusethisusageinformationforenforingslas.

pu,memory,diskaess,andnetworkbandwidthareamongtheresouresthat

mustbesheduledtoignoreanyrisksviolatingansla. Forexample,ansla

guaranteeingsomespeiedleveloflesystemthroughputanbeviolatedwhen

there is insuient pu time to handle le i/o, insuient memoryto buer

ledata,orinsuientdisk bandwidthto readorwritelebloks.

This artile presents a new isolation kernel, Vortex. Vortex implements

ne-grainedaountingand sheduling of systemresoures. It denes an ab-

strationforenapsulatingresoures,asystemstruturethatallowsresouresto

besheduledindividually orin aoordinated fashion, andaommon interfae

toresoure-usageaountingandattribution.

Threedesignpriniplesservedasafoundationforthedesign:

(1) Measure all resoure onsumption. If hosted servies an onsume re-

soures whose usageis not measured, then resouresharingpoliies an

be irumvented. Consumption of resoures is, to the extent possible,

attributedbyVortextothehostedserviemakingthedemands 1

.

(2) Identify the unitto be sheduled with the unit of attribution. Considera

workerthread handlingasynhronousi/o requests onbehalf of multiple

hosted servies (an approah used in Windows). If this worker thread

istheunit beingsheduled,thentheshedulerhasnoontroloverwhih

i/orequestsarehandled,evenifresoureonsumptionouldberetrospe-

tivelyattributedtotheorrespondinghostedservie(s).Betterontrolan

beahieved by diretlyshedulingtheindividual i/o requestsinsteadof

the worker thread. That is, a one-to-one orrespondene is established

betweentheunit ofshedulingandtheunitofattribution.

(3) Employ ne-grainedsheduling. Thisallowslesserrorin attribution and

inreasesopportunitiesforsharing.

Therestofthisartileisorganizedasfollows.InSetion2weoutlinethekey

elementsoftheVortexarhitetureanddisussimpliationsofourthreedesign

priniples. Setion 3 givesa detailed exposition of important elementsin our

implementationofVortexonthex86platform. Setion4presentsanevaluation

of theimplementation, using dierentbenhmark appliations to determine if

our Vortex implementation instantiates our design priniples. Related work

appearsinSetion 5,andSetion 6oerssomeonlusions.

1

Someresoure onsumptionishard toattributeat the timeof onsumptionand must

beattributedaposteriori. Examplesinlude:putimedevotedtoproessinginterruptsand

demultiplexinginomingnetworkpakets.

(3)

pathrequests. sheduler.

() Resouresorganizedinagridwithshedulersontheom-

muniationpath.

Figure 1: Summaryof keyarhitetureelements.

2 Kernel arhiteture

2.1 Arhitetureoverview

Figure 1 depits the key elements of the Vortex arhiteture. Eah resoure

orresponds to a ne-grained software omponent, exporting an interfae for

aess to and use of hardware or software, suh as an i/o devie, a network

protoollayer,oralayerin alesystem.

Higher-levelkernelabstrationsand funtionalityare implemented byon-

guringresouresintoaresouregrid,whereresouresexhangeresourerequest

messages. A resourerequest message speies parametersand afuntion to

invokeattheinterfaeofthedestinationresoure. Theserviingofarequestis

asynhronoustothesendingresoure.

Shedulers may be interpositioned between resoures. Requests reeived

bya sheduler may be buered and/or dispathed to aresourein any order

onsistentwithinter-requestdependenies.

Toaountfor resoureonsumption, exeutionin response to arequestis

monitored. Themonitoring is performed external to aresoure, using instru-

mentation ode that measures pu and memory onsumption to exeute the

request,perhapsdeterminingthosevaluesretrospetively. Aftereahrequestis

exeuted,itsresoureonsumptionisreportedtothedispathingsheduler.

All resourerequests speify an ativity to whih resoure onsumption is

attributed. Ifaresouresendsrequest

r ₂

^as^part^of^handling^request

r ₁

^,^then^the

ativity of

r ₂

^is ^inherited^from

r ₁

^. Computations involvingmultiple resoures anthusbeidentiedasbelongingtooneativity. Anativityanbeaproess,

aolletionofproesses,orsomeproessingwithinasingleproess.

(4)

tions,where anativitytypiallyisassoiatedwithaproess.

2.2 Measure all resoure onsumption

Thepuonsumption inurredbyadisk deviedriverto handlearequestfor

reading

10

setorsonadiskistypiallythesameaswouldbeneededforarequest toread

20

setors. Butmemoryusagediersforthese tworequests. Moreover, the atual elapsed time for exeuting the two requests will vary, depending

onthe ontents ofdisk ontrollerahe,the position ofdisk heads,rotational

position,et. Thus,adiskisanexampleofaresourethat,foreetiveontrol,

requires a sheduler with aess to information that is not easily aptured in

software,butouldbepreditedbysoftware. Forexample,theontentsof the

diskontrollerahemightnotbeaessiblebutanbeestimatedbyknowledge

ofitssizeandobservationsofhowlongittakestoompleterequests.

Togiveshedulers aess to hiddeninformation, Vortex usesresoure on-

sumptionreords. These areextensible data struturesdesribingtheresoure

onsumptioninurredbyexeutingaresourerequest. Fieldsonerningbasi

resoureonsumption are set by Vortex instrumentation ode, and additional

eldsareattahedbyinstrumentationodeinsidetheresoureitself. Forexam-

ple,reordsdesribingresoureonsumptionwhenexeutingadiskreadrequest

ouldinlude pu and memory usagealong withadditional information: how

longittook to omplete therequest, andthe size ofthe queueof pending re-

questsatthediskontroller. Thisadditionalinformationwouldbesuppliedby

instrumentationoderunninginthediskdriver.

Measurement and attribution of resoure onsumption are separate tasks.

Measurement is always retrospetive whereas attribution may or may not be

knownin advaneoftherequestproessing. Forexample,when areadrequest

is submitted to a disk driver, the ativity to attribute is typially known in

advane, but resoureonsumption mightnot be available until after request

exeutionompletes. Anotherexampleisinterruptproessingorearlynetwork

paket proessing, where the ativity to attribute is diult to dedue until

proessingompletes. If resoureuse must be predited,then asheduler an

useheuristisbasedonhistorytoestimateresoureonsumption.

If attribution annot be determined, for exampleif an ativity annot be

assoiated with some network paket proessing, slas might beviolated. No

amountofinstrumentation,sheduling,orover-provisioning,anensurethatan

slawillbesatisedinthefaeofunantiipatedload.Theimpliationisthatan

isolationkernelimplementationmustmakeassumptionsabouttheenvironment.

2.3 Identify the unit to be sheduled with the unit of at-

tribution

Ourarhiteturerequiresshedulerstoontrolexeutionofindividualrequests,

whereeahrequestspeiesatmostoneativityforattributionofresoureon-

(5)

sumption . Notie, however,that even ifeah request is identied with some

ativity,thenattributionambiguityremainspossible. Consideraleblokahe

thatoptimizesmemoryutilizationbysharingidential lebloksarossativi-

ties. Iftwoativitiesaessthesameleblok,thentheresoureonsumption

inurredby fething andahing theblokould oneivablybe attributed to

eitherativity. Theshedulershouldthereforebeawareofthesharing. Inpra-

tie,this isaomplishedbyreordingresoureonsumptionreordsprodued

whena le blok is fethed and ahed, andhavingthese reords available to

shedulers.

Timelyexeutionofarequestmustbeensured,andsharinganauseom-

pliationshere. Consideraleblokrequestmadewhen anidential leblok

is alreadysheduled forfeth to satisfysome other ativity. i/o utilization is

improvedbydelayingthisseondfethrequestuntilthefethfortherstom-

pletes. But,depending onthesheduler, thepending feth ouldbesheduled

soonerifperformedin ontext oftherequesting ativity. So, timelyexeution

requires knowledge of a seond request, and using priority inheritane teh-

niques [57℄. Our poliies for attribution and sheduling must aommodate

suhnuane.

2.4 Employ ne-grained sheduling

Ashedulermightnotbeabletopreditwhatresoureonsumptionwillresult

fromashedulingdeision. Forexample,aleistypiallyimplementedusinga

leblokahe,lesystemode,avolumemanager,and adevie driverlayer.

Eah employs ahing, and a le system request ould traverse all or only a

subsetof thelayers. A sheduler isunlikelyto knowin advanewhat layersa

lerequestwilltraversenorwhatisahedatthetimearequestismade. Thus,

onsideringlerequestsastheunit ofshedulingmightentangleresouresthat

asheduler would wantto ontrol separately. Forexample,a shedulermight

wanttoontrolrequeststotheleblokahebasedonmemoryonsumption,

whereastheamountofdatatransferredmightbeadesirablemetriat thedisk

driverlevel. Todisentangleresoureonsumption, theVortexkernelis divided

intomanyne-grainedresouresthatanbeontrolled separately.

An inreased number of resouresimplies a orresponding inrease in the

numberofrequeststhat haveto be sheduled. Thisinreasesshedulingover-

head. Toredueoverhead,ourarhitetureexeutesallrequeststoompletion.

Oneasheduler dispathesarequestto aresoure,theproessing ofthat re-

questisneverpreempted. Theabseneofpreemptionimpliesthatrequestsan

bedispathedwithlittleoverhead.

Our arhiteture expets resoures to handle onurrent exeution of re-

quests, as needed on a multi-ore mahine. Consequently, resouresuse syn-

hronizationmehanismsto protettheirsharedstate. Abseneofpreemption

simpliesthingsonsiderably. A systemthatdidhavesupportforpreemption

2

Hardwarerestritionsmightlimitashedulertoontrollingexeutionofanaggregateof

requests. For example,thehardware mightnotsupportidentifyingativitieswithseparate

interruptvetors.

(6)

ofrequestexeutionwouldhavetoreleaseloksbeforereturningontrolto the

shedulerorriskdeadloksduetopriorityinversion[57℄. So,ashedulerinsuh

asystemwouldhavetomakeallowanesforinreasedrequestexeutiontimein

theaseofontestedloks. Vortexshedulersneednotbeonernedwithsuh

ompliations.

3 Kernel implementation

3.1 Sheduler toolkit

Vortexemploysatoolkitthatenapsulatesandautomatestasksommonaross

shedulers. The toolkit provides implementations for aggregation of request

messages,inter-shedulerommuniation,managementofresoureonsumption

reords,resourenaming,andinter-ore/puommuniationandmanagement.

The toolkit provides request queuesas ontainers for requests that require

a spei resoure, as illustrated in Figure 2. Whenever a resoure sends a

request,the toolkitloatesanexisting request queueorreatesanewone,on

whihtherequestwillbequeued. Asheduleranread,reorder,andmodifythe

queue. A typialsenarioarises with disk requests, where the order in whih

requestsareforwardedtothediskisre-orderedtoreduediskheadmovement.

Dependenies among requests are speied by assigning dependeny labels

to requests. Shedulers ensure that requests with the same dependeny label

areexeuted in the ordermade. Requests belongingto dierent ativitiesare

always onsidered independent, as are requests sent from dierent resoures.

Assuh,aresourean generatedependenylabelsbyusingasimpleounter,

whih isonatenatedwiththesending-resoureidentierandtheidentierof

theativitytoattribute.

Eah request is represented using a data struture ontaining: the desti-

nationresoure, thesending resoure, theativity to attribute, a dependeny

label, an anity label, and a desription of whih funtion to invoke in the

destinationresoure(alongwithparametersto thatfuntion).

Figure 3illustrates thedierentstepsinvolvedfromwhen arequest issent

(7)

request. request.

Figure3: Stepswhensendingandexeutingarequest.

until it is exeuted in thereeiving resoure. Sending a request follows three

stepsinFigure3(a)where(1)theshedulerassoiatedwiththequeueisnotied,

(2) the request is queued, and (3) the sheduler is given an opportunity to

requestputimefromapumultiplexorbeforeontrolisreturnedbaktothe

sendingresoure.

Then, asdepited in Figure 3(b), exeutionof arequest follows four steps

where(1)thepumultiplexordeidestoallotputimetoapartiularresoure,

(2)thegoverningshedulerisonsultedforadeisionasto whatrequest(s)to

dispathtotheresoure,(3)theseletedrequest(s)aredispathedandexeuted

toompletion,and(4)resoureonsumptionreordsaremadeavailableto the

governingsheduleratsome,possiblylater,point.

A sheduler anbeonguredto request resouresfrom another sheduler

instead of from a pu multiplexor. This provides a means to ontrol other

shared resoures. For example, i/o devies are typially attahed to a host

omputerthroughani/obusthat anbesharedwith otheri/odevies. This

bus may, in turn, be part of a hierarhy of shared buses, terminating at an

interfaeto main memory. If theaggregateapaity of onnetedi/o devies

exeedsthe apaityof the bus hierarhy,then the apaity ofany singlei/o

deviewillvarydependingonurrentbusload. Utilizingtheabilitytoongure

shedulerstorequestresouresfromanothersheduler,ani/obussheduleran

beintroduedwithouttheneedtomanifestthei/obussesaspreedingresoures

intheresouregrid.

Moredetailsonshedulerimplementationanbefoundin theAppendix.

3.1.1 Sheduling multi-orearhitetures

In a multi-ore system, one pu multiplexor is assigned to eah ore. Eah

multiplexorontrolshowtheoreissheduled. Toeientlyexploitmulti-ore

arhitetures,ertainsets ofrequestsare best exeutedonthesameoreoron

oresthat an eiently ommuniate. Forexample, weimproveahe hits if

requeststhat resultin aessto the samedatastrutures are exeutedon the

(8)

sameore.

Toonveyinformationaboutdataloality,resouresattahanitylabelsto

requests. Anitylabelsgivehintsaboutpu multiplexorpreferenes; ifapu

multiplexorreentlyhasexeutedarequestwithapartiularanitylabel,new

requestswiththesameanitylabelshouldpreferablybeexeutedbythesame

pumultiplexor.

The toolkit onsults the sheduler preeding a resoure to obtain a pu

multiplexorbindingforananitylabel. Thereturnedbindingisahedbythe

toolkit until an expiration speied by the sheduler; until expiration, subse-

quent requests with the same anity label are exeuted by the seleted pu

multiplexor. Thetoolkitensuresthat(1)requestsareonlyexeutedbythepu

multiplexorseletedbythegoverningsheduler,(2)putimeisonlyrequested

fromseletedpumultiplexors,and(3)apumultiplexoronlydequeueseligible

requests.

Figure 4 illustrates a sheduler requesting pu time from four pu multi-

plexors. One way to instantiate this onguration is to allow sheduler and

queuestate to beaessed onurrently by allfour pu multiplexorson both

requestqueueanddequeuepaths. Thisdesignriskssynhronizationbottleneks

and exessive inter-ore exhanges of sheduler and queue state. Tomitigate

thisrisk,thetoolkitalwaysinstantiatesmulti-oreongurationswithseparate

requestqueuesperore,asillustratedinFigure5. Inaddition,thetoolkitpro-

motes asheduler struture that separatesshared andore-spei state. For

example, a round-robin sheduler would maintain per-ore state about regis-

tered lients (i.e. request queues) along with a shared ounter for reating a

pumultiplexorbinding. Similarly,aweightedfairqueueing(wfq) [18℄shed-

uler would maintain per-ore stateabout lients but rely on a more omplex

strategyfor deidinghowanity labelsare bound topu multiplexors 3

. Un-

3

Ourwfqimplementationinspetsper-orestatetodeidewhihpumultiplexorshould

handlean anitylabel; oneload sharingalgorithm thatwehave implemented assigns the

label tothe ore at whihthe orresponding ativityhas proportionally reeived the least

resoures.

(9)

der this struture, sharing typially only ours when requests are sent from

oneoreand queued forexeution onanother, and when asheduler inspets

sharedstatetoseletapu multiplexorforananitylabel.

Withseparaterequestqueuesperore,exeution-orderonstraintsimposed

bydependenylabelsaretrikytosatisfy. Ifrequestswiththesamedependeny

label are queued to dierent pu multiplexors, then load imbalane among

the pu multiplexors ould result in violating exeution order dependenies.

This is preventedin Vortexby requiring resouresto assign the same anity

labeltodependentrequests,ausingdependentrequeststo havethesamepu

multiplexorbinding,henebeplaedinthesamerequestqueue.

Anotherompliation,whihishandledbythetoolkit,isexpirationofapu

multiplexorbinding. Ifabindingexpireswhiletherearequeuedrequests,then

thetoolkitwill, inoneatomiation,obtainanewbindingfromthegoverning

sheduler, move aeted requests to a potentially new queue, and update its

pumultiplexorbindingahe.

3.1.2 Sheduler onguration

Aongurationleprovidesthetoolkitwith informationitneeds forinstanti-

atingshedulersin aresouregrid. Theongurationledesribesthetypeof

shedulertouseateahresoure,aswellasdesribingongurationparameters.

Theproessofinstantiatingthese shedulersisfullyautomated: at boottime,

thetoolkitreadstheongurationleandinstantiatesshedulers.

The toolkit maintains a repository of all available shedulers. Shedulers

in thisrepositoryare ompiledaspartofthekernel. Eah sheduler isnamed

aordingtothetypeofalgorithmitimplements. Forexample,ourwfqshed-

ulerfallsintotheategoryproportionalshareshedulersandis,assuh,named

propshare.wfq.Thenameofashedulerisusedinaongurationletospeify

(10)

<

?xmlversion="1.0"?

>

<

shedulerong

<

/algorithm

>

<

/pumultiplexor

>

<

pumultiplexortag ="pumux1"

>

<

ore

>

1

<

/ore

>

^!

−−

^Resoure ^shedulers

−− >

<

resouresheduler

>

<

resoure

>

resoure.tp

<

/resoure

>

<

^algorithm

>

propshare.round

−

^robin

<

^/algorithm

<

/resouresheduler

>

<

resouresheduler

>

<

^resoure

>

/pumultiplexor

>

<

/resouresheduler

>

<

/shedulerong

>

Figure6: Exerptfromashedulerongurationle.

thepartiularshedulertoassoiatewitharesoure.

Figure 6 ontains exerpts from a onguration le, where a round-robin

sheduler is seleted for the tp Resoure and a strit-priority sheduler is

seletedfortheThreadResoure 4

. Thetpshedulerisongured torequest

pu time from both pu multiplexor

0

and pu multiplexor

1

; the Thread Resoureonly requests pu time from pu multiplexor

0

. The onguration of Figure 6is an exampleof an asymmetri onguration, i.e. a onguration

whereresouresareonguredto useonlysubsetsoftheavailable ores. Suh

ongurationsarefullysupportedbythetoolkit. Thisallowsdeploymentswith

someoresdediatedtoresoures,wheresalingthroughne-grainedlokingor

avoidaneofshareddatastruturesisdiult. Typialexamplesareresoures

that govern i/o devies using memory-based data strutures to speify dma

operations.

4

TheThreadResoureprovidesathreadabstrationforproesses.

(11)

Thetoolkitdoesnotanalyzesheduleromposition,soaongurationmay

ontainaws. For example,ifaresoureissheduledusinganearliestdeadline

rst[41℄ algorithmand pu timeis requestedfrom apu multiplexorusing a

wfqalgorithm,thentheresouresheduleranmakenoreal-timeassumptions

about deadlines. Reasoning about orretness requires a formalization of the

behavior of eah sheduler, and then an analysis of the interation between

behaviors. See[22,25,37,40,54,55℄forworkinthisdiretion.

3.2 Virtual memory management

TheVortexvirtualmemorymanagement(vmm)arhitetureisdepitedinFig-

ure 7. The Address Spae Resoure (asr) implements logi for onstruting

and maintainingpage tables and also provides aninterfae for alloating and

ontrollingtranslations for regions of anaddress spae. asr is used by other

resourestoexportandmakedataobjetsaessibleinaproessaddressspae.

Forexample,theExeutableResoure(er)usestheasrinterfaetoexportthe

segmentsofanexeutablele(text,data, bss,et.) intothepertinent regions

oftheaddressspae.

Pagefaultsarediretedtotheasr. Tohandleoneofthese,asrdetermines

whether thefaulting addressis in a regionalloated by someresoureand, if

so,sendsarequestfordatatotheresoureresponsibleforthataddress. When

reeivingsuh a request, resoures are requiredto respond with data already

ahedin theresoure,byalloatingmemoryfrom thememorymultiplexoror

by retrieving the data from other resoures. For er, further ommuniation

withtheFileCaheResoure(fr)istypiallyperformedtoretrievedatafrom

theexeutablele.

TheSwapResoure(sr)providesaninterfaeforpreservingobjetsonse-

ondarystorage. Resouresusesrwheneverrelaimedmemoryontainsobjets

noteasily reonstrutedfrom other soures. Forexample, textanbe re-read

fromanexeutable,butmodiedheapand bssmemorymustbepreservedfor

(12)

Relaimingmemory

Whetheradditionalmemoryisneededwhenexeutingarequestisdiultfor

thesendingresouretodeterminewithoutaesstostatethatisinternaltothe

reeiving resoure. For example, the reeiving resoure might use ahing to

speedup request proessing. Therefore, resoures alloate memory from the

memory multiplexor when needed, typially as part of exeuting a request.

Availablememorybeinglowortheorrespondingativityexeedingitsmemory

budget,auses thememory multiplexorto rejet analloation. Insuhases,

memoryrelamationationsmust be initiatedto ensureeventualexeutionof

theoriginalrequest.

The memorymultiplexor deideswhat physial memory to relaim. A re-

souremustbepreparedtorelinquishreferenesto alloatedmemoryuponre-

eivingmemoryrelamationrequestsfromthememorymultiplexor. Forvoiding

referenesto thephysialmemoryspeiedin arelamationrequest,resoures

are required to determine what that memory is used for. To maintain this

orrespondene,thememorymultiplexorinterfaeallowsresourestoassoiate

ookies with memory alloations. An assoiated ookie is returnedwith eah

memoryrelamationrequest;thisookieaidsinloatingreferenestothemem-

orybeingrelaimed. Forexample,whenfralloatesmemoryforaleblok,a

referenetotheleservesastheookie. Thatway,ifthememoryisrelaimed,

thentheookieenablesthefrtoupdateitsinternaldatastrutures.

Ourimplementationassoiatesaseparateativitywitheahproess,so the

relamationpoliy of the memory multiplexor dierentiates among proesses.

Byinspeting alloationrequests,the memorymultiplexorandeterminehow

muh memoryeahresoureonsumesonbehalfof apartiularativity. Still,

makingrelamationdeisionsonduiveto improvedperformanetypiallyre-

quiresadditional information. Forexample, iffrequently usedmemory in the

proess heap is relaimed then performane will erode. Likewise, relaiming

proesstextmemorywillresultin poorperformane.

To obtain needed additional information, the memory multiplexor relies

on resoureinstrumentation, to produe resoure information reords. These

reordsprovide memoryusage statistisand other pertinentinformation. For

example,asrregularlyolletsthemodiedandaessbitsstoredbypageta-

bles. Similarly,asrinformsthememorymultiplexorwhethermemoryhasbeen

modied.

Theatofrelaimingmemorymightrequireupdatesinresouresotherthan

theonethat initially alloated thememory. Forexample, er relies onfr to

ahesegmentsoftheexeutablele. Moreover,erusesasrinordertoinsert

pagetabletranslationsforthosesegments.Hene,memoryforahingsegments

isinitiallyalloatedforfr,butreferenestothataheultimatelyexistinboth

thefrandtheasr. Inordertorelaimthismemory,updatesinasrandfr

are needed. The memory multiplexor oers an interfae for this. Using the

interfae,asrausesthememorymultiplexortodiretrelamationrequeststo

(13)

page table updates and forwards the request to the resoure responsible for

theorrespondingregion. In the aseof exeutablesegments,er will in turn

performitsinternalbookkeepingandthenforwardtherequesttothefr.

Assoiating a singleativity with all vmm-relatedrequests from aproess

does not prohibit asheduler from treating various types of proess requests

dierently. We have implemented shedulers for fr that reorder and delay

queuesaordingto thesending resoure;thisallowsVortexto favordemand-

pagingtraoverregulari/otrafromaproess. Itreduesthetimebefore

memoryis freed for reuse and also theduration aproess is bloked awaiting

arrivalofpagesnotpresent.

3.3 I/O

Vorteximplementsthe posix asynhronous i/o interfae. This interfae sup-

ports asynhronoustransferofdatabetweenbuersin aproessaddressspae

andakernelsupportedi/oresoure. Eahi/ooperationisdesribedbyadata

struturethatspeiesadesriptoronwhihtheoperationistobeperformed,a

pointertoadatabuer,andsomeindiationofhowtheallingproess/thread

shouldbenotiedonetheoperationterminates.

3.3.1 Asynhronous I/O

The posix asynhronous i/o interfae is largely implemented by the asyn-

hronousi/oresoure(aior). aiorabstratseah i/ooperationintermsofa

soureresourethatproduesdataandasinkresourethatonsumesdata. The

soureorresponds to theproviderof data for aregionin theproess address

spae in the ase of writes, and it orresponds to any i/o resourefor reads.

Thesink isanalogous.Theaiororhestratesdataowfrom soureto sink.

aiorrequestsdatafromasoureresourebysendingitareadrequest. The

soureinturnrespondswitharead_donerequestontainingthetargetdata.

A similar protool is used when interating with sink resoures. aior writes

datato asinkby sendingawrite requestto it, andthesink signalsthat the

datahasbeenonsumedbysendingawrite_donerequestbak. Souresand

sinksmayuseotherresourestosatisfyareadorwriterequestortointerat

withahardwaredevie.

i/ooperationsanexeuteonurrently. Prefethingandoverlappingintro-

dueorderingonstraintsamongrequestsbelongingtothesamei/ooperation,

beausedatamustarriveatasinkintheordersentbyasoure. aiorsolvesthis

problembyassigningthesamedependenylabeltoallrequestsderivedfromthe

samei/ooperation. Thus, multi-ore parallelizationoursat thegranularity

of i/ooperations.

Similar to Vortex' vmm system, aior sets the ativity binding of derived

requeststotherequestingproess. Byinheritane,allotherrequestsgenerated

aspartofthei/ooperationwill thenpointtothesameproess.

(14)

Interruptsare integralto the operationof many i/o devies. A resourethat

operatessuhani/odeviemustregisterwiththeInterruptResouretoreeive

interruptsoriginatingfromthedevie. Interruptsareinitiallyapturedbyalow-

level Interrupt Resoure handler, whih reates and sends a resoure request

desribingtheinterrupttotheappropriateresoure.

Resoureonsumption forinterruptsis attributed retrospetively. For the

low-levelhandler,instrumentationodereatesresourereordsto returnpu

timetoanyinterruptedativity. Similarly,instrumentationodeintheresoure

reeivingthe interrupt request produesresoure reordsfor retrospetiveat-

tribution,iftheausing ativityanbededued.

3.4 The proess, system alls, and threads

A resoure may export routines in its interfae that should be aessible not

only to other resoures but also to proesses. Suh funtions are exposed as

Vortex system alls. The resoure programmer ahievesexposure by using a

stub generation faility that, for eah funtion, reates a stub for onverting

asystem all into a resourerequest message sent to theresoure. The stub

alsodeouplessystemall argumentsfromanyproess-dependentontext. For

example, the stub translates virtual memory pointers to their orresponding

physial memory pointers, ausing page faults ifneessary to bring data pre-

servedbytheSwapResoureintophysialmemory. Refereneountingensures

thephysialmemorypointersarevalidforthedurationoftheall 5

.

System all messages from a proess originate from the Proess Resoure

(pr). The pr implements the onventional proess abstration,using asr to

handleaddressspaeoperations. Toimplementproessexeutionontexts,the

pr uses theThread Resoure(tr). tr provides an interfaefor onventional

threadoperations,suh asreate,exit,suspend,resume,join,et.

tr drivesexeution of threads by using resourerequest messages. When

athread enters the ready state, aresoure request is sent to tr, leading the

trshedulerto request pu time from apu multiplexor. When therequest

isdispathed,trloatestheontrolblokoftheorrespondingthread,setsup

atimeslie timer, and ativates the thread. After ativation, the thread runs

untilthetimeslieexpiresorablokingation isperformed. Whilethethread

isrunning,thepumultiplexorregardstrasexeutingrequests. (Preemption-

interruptsaredelivereddiretlyfromthelow-levelInterruptResourehandler,

sinesubjetingthesetoshedulingwouldrequireinvolvementofthepumul-

tiplexor.)

Only the proess address-spaeand system-all stubs are addressable to a

thread. Consequently,athreadannotsubvertashedulerbydiretly invoking

afuntion inaresoureinterfae.

Turningsystemallsintorequestsinreasesoverheadbutimprovessheduler

ontrol. Forexample,adiretly-invokedfuntion oulderodeshedulerontrol

5

Conurrentrelamationofmemoryisdelayeduntiltheallompletes.

(15)

dispathed requests. Yet, in some ases, exeuting a funtion does not in-

terfere with sheduler ontrol. Examples inlude alls suh as getpid() and

gettimeofday() and funtions in the tr interfae. To aommodate these

ases,theresoureprogrammerisallowedto onstrutstubsthat diretly all

funtionsintheresoureinterfae.

Diret invoation of funtions in a resoureould allowone servie to in-

terfere with others. For example, in Vortex, we primarily use interproessor

interrupts(ipis)todispathworkthat requiresimmediate exeutiononaspe-

iore. In an earlyimplementation, weused ipis to perform operationson

threads hosted by remote ores. This deision, however, enabled athread to

disruptworkbeing performed onall oresin the systemby spawning aseries

ofthreadoperations. Theurrentimplementationusestheipimehanismonly

when the target thread is running on a remote ore; otherwise, a request is

insteadsentto trresoure.

3.5 Resoure implementation

Kernel-levelprogrammingwithinVortexamountsto implementingresourere-

quest message-handlers and resoure shedulers. A typial message handler

mightreply to a request orsend arequest to another resoure. The fr, for

example,doesboth: itmayrespondwithadiskblokfromitsaheoritmay

sendarequestto alesystemresoure.

Toassist the kernel-programmer, Vortexoers support for several onur-

renyandontinuationmodelsforhandlingrequests.

Per-resoure bloking: Here,aresouremay temporarily suspend de-

liveryofrequests,whihthenaumulate attheiroriginalrequestqueues. Un-

blokinganbedonebyanotherresoureorbydeliveryofaninterruptrequest.

Thisstrutureisusefulforimplementingdriversfori/odevies,whoseapaity

maybeoasionallyexeededbytheowofrequests.

Per-request bloking: Whenonly somerequestsrequirebloking,per-

requestblokingismoreappropriate. Consider,forexample,aFileCaheRe-

sourethatontainssomeoftherequesteddiskbloksbutnotothers,requiring

a feth from a le system resoure. To support suh situations, the toolkit

introdues apending queue. Whena resoureneedsto blok aninoming re-

questuntilitreeivesareplytoitsoutgoingrequest,theresoureanplaethe

inomingrequest intothe pending queue andattah atrigger tothe outgoing

request. Triggerspointtooneormorerequestsinthependingqueue. Resoures

arerequiredtoinludethetriggerintheirreplytoarequest,sothetoolkitan

unblokthereferenedrequest automatially whenthereply arrives. Multiple

requests an be assoiated with the same trigger, allowing multiple requests

fromthesameativitytobeunblokedsimultaneously.

Expliit ontinuations: In resoures with several potential bloking

points,per-requestblokingmayauseredundantre-exeutionofodeafterun-

bloking(sine exeutionalwaysstartsat thebeginning). Forexample, in the

Vortexext2lesystemresoure,arequestmayhavetobeblokedthreetimes,

(16)

Tohelpavoidsuhredundantre-exeution,oursystemallowsblokedrequests

toarryapointertoahandlerroutinethatresumesexeutionafterunbloking.

Cooperative threading: When a resoure uses expliit ontinuations

with alarge numberof bloking points, the ode is split into many funtions

withoutalearontrol owbetweenthem. Cooperativethreading allowspro-

grammerstouseblokingoperationsinresouresbysavingandreoveringthe

statebehind thesenes. To useit, aresourewould typially spawn for eah

request a separate thread, whih would exeute for aslong as the request is

beingproessed.

4 Evaluation

Vortexis implementedin Cand, exludingdevie drivers, omprisesapproxi-

mately

70000

linesofode. Thesystemrunsonx86-64multi-orearhitetures.

Thequestions wehopedtoanswerinourevaluationofVortexwere:

1. Isallresoureonsumptionauratelymeasured?

2. Isresoureonsumption attributedtotheorretativity?

3. Does the arhiteture permit suient ontrol for shedulers to isolate

ompeting ativities?

Inallexperiments,VortexwasrunonaDell PowerEdgeM600bladeserver

withtwoIntelXeonE5430Quad-Coreproessors.Coresrunat 2.66GHz,have

separate 64x8 way 32KB data and instrution ahes, and, in pairs, share a

6MB 64x24 way ahe (for a total of 4 suh ahes). Eah proessor has a

1333MHz front-side bus and is onneted to 16GB of DDR-2 main memory

runningat 667MHz. Through itsPCIe x8 interfae, the serverwasequipped

with two 1Gbit Broadom 5708Snetwork ards. And, to the integrated LSI

SASMegaRAID ontroller,two146GBSeagate10K.2diskswereattahedand

setupinaraid0(striped)onguration.

Togenerate load,weused a luster of blade serversrunningLinux 2.6.18.

Thesewereofthesametypeandhardwareongurationastheserverrunning

Vortex,andtheywereonnetedto theVortexserverthroughadediated HP

ProCurve4208Gigabitswith.

4.1 Measurement tehnique

Usingasystemallinterfae,aproessanobtaindataonitsownperformane

and,subjettoongurableaessrights,theperformaneofotherproessesin

thesystem. These performane dataare obtainedfrom shedulers throughan

interfaethattheyarerequiredtosupport(showninTable3oftheAppendix).

For eah lient of a sheduler, the data inludes attributed pu and memory

onsumption and, if used, onsumption as attributed by the sheduler using

otherperformanemetris.

(17)

proessonVortex. Thisproesswasgrantedfullaesstoallperformanedata

in the system and exported this data upon request using tp. External to

Vortex, a sript ommuniated with the proess, olleting samples one per

seond. The size of eah sample was around 100KB; whenever possible, the

sriptaessedanetworkinterfaeard notativelyused inanexperiment.

When a proess performs a system all to obtain performane measure-

ments, Vortex returns measurements timestamped with the urrent value of

thepu timestampounterregisterofore

0

. These timestampsorrelatepu measurements with elapsed time; disrepanies reveal unattributed pu on-

sumption. Retrospetive attribution ompliates things. Some samples indi-

ateunder-attributionwhileothersindiateover-attribution,ifthereisongoing

resoure-onsumptionwhenthesamplesareobtained. Dataauray,however,

isbounded bytheonsumptioninurredbyproessingonerequestmessage.

Mostmessagesanbeproessedbythepuin afewmiroseonds,ausing

aurayto bein thesameorder. Thread-readymessages,however,may lead

to several milliseonds of uninterrupted pu onsumption. The auray of

performanedatapertainingthreadsandtheoverallpu-time onsumptionon

oresthat runthreadsdependsuponhoieofthread timeslies. Forexample,

withthreadtimesliessetto

5

milliseonds,theexpetedaurayis

± 0.5%

for individualsamples. Weveried that ourmeasurementsare in agreementwith

expetedauraybyperformingaseriesofexperimentswithaproessrunning

onepu-boundthreadperoreandvaryingthedurationoftimeslies. Inthese,

wefoundnosamplestobeoutsideexpetedauray.

Individualsamplesmaybeinaurate,butunder-attributioninonesampleis

ompensatedforinthenextsample. Thus,foraseriesofonseutivesamples,a

deviationbetweenresoureavailabilityandattributionlargerthantheexpeted

aurayofanindividualsampleindiatesthatsomeonsumptionisnotbeing

properlyaountedfor. Intheaforementionedexperiments,omparingthesum

of elapsed to the sum of attributed yles shows the number of unaounted

ylestobewithin theexpeted aurayofindividual samples. Forexample,

in one experiment, over

100

seonds, a total of

86, 028, 592

yles were not aountedfor(

0.004%

ofelapsedyles). Thiswaswithintheexpetedauray ofanindividual sample(

± 106, 400, 000

yles).

Duringanexperiment,weensuredthattheonlyproessesrunningonVortex

were those involved in the experiment itself. We ran eah experiment

10 20

timesto verify thepreisionofperformanedata; deviations were foundto be

within the auray of individual samples. For larity, we therefore do not

inludeerrorbarsingures. Also,foreaseofvisualinterpretation,somegures

wereproduedusingGnuplotwiththedgrid3dommand 6

.

6

Indgrid3dmode,griddatapointsrepresentweightedaveragesofsurroundingdatapoints,

withloserpointsweightedhigherthandistantpoints.

(18)

cpuhog50 cpuhog33 cpuhog17

0 10 20 Time (seconds) 30 40 50 60 70 80 90 100

0 1 2 3 4 5 6 7 Core # 15

20 25 30 35 40 45 50 55

% CPU utilization

Figure 8: pu utilization runningthree pu-bound proesseswith

50%

,

33%

, and

17%

puentitlementandpumultiplexorsonguredwithwfqshedulers.

4.2 Attributing CPU onsumption

Toevaluatewhetherpuonsumptionisbeingattributedtotheorretativity,

weondutedanexperimentinvolvingthreepu-boundproesses. Eahproess

ranonepu-bound thread perore. Reall from Setion 3.4 that threads are

implemented by the Thread Resoure (tr). The tr drives the exeution of

threadsbyproessingtherequestmessagessentto itwhenathread enters the

ready state. Proessing a message involves setting up a timeslie timer and

dispathingtheorrespondingthread. Toisolateproesses,Vortexreatesone

trinstane perproess. Eah trinstane operateswithaseparate sheduler

thatmanagesthreadsbelonging toaorrespondingproess 7

.

In the experiment, pu multiplexors use a weighted fair queueing (wfq)

sheduler and assign weights to tr instanes of the proesses aording to a

50%

,

33%

, and

17%

entitlement. For the tr shedulers, we used a simple round-robinshedulerwithaloadsharingalgorithmtherebyensuringthatpro-

essthreads runon separateores (i.e.pu multiplexorbindings withinnite

durationandinitialbindingalwaysassignedto theorewiththeleastnumber

ofthreads bound to it). Figure 8illustrates theresultingpu utilization: the

pumultiplexorwfqsheduleroneahoreallotsputimeto trshedulers,

whih in turn exeute proess threads, in strit aordane with the desired

50%

,

33%

,and

17%

entitlement.

7

Thisavoidssenarios where,for example,a proess reateslotsof threadsinorderto

inreaseshedulingoverheadforotherproesses.

(19)

4.3 Attribution and isolation under ompetition

Theprevious experimentdoesnotestablish whether pu onsumption is or-

retlyattributed when aresourereeivesrequestsfrom multiple independent

ativities.

To evaluate attribution-auraywhen a resoureproesses requests from

independent ativities, weonduted an experimentwith three proessesper-

forming le reads. The proesses eah ran one thread per ore, with threads

programmedto onseutively open a designated le, read 32KB of data, and

thenlosethele. Toperformaread,threeresouresareinvolved 8

(inaddition

to the tr instanes): the Address Spae Resoure (asr), Asynhronous i/o

Resoure(aior),andtheFileCaheResoure(fr).

Due to the few les involved, the experiment is pu-bound. And sine

threadsawaittheompletionofonereadoperationbefore performinganother,

throughputisdependentontheamountofpuavailabletothethreadsandthe

threeresouresinvolved.

Intheexperiment,weonguredaresouregrid,asillustratedinFigure 9,

withseparatewfqshedulers fortheasr, aior,andfrresoures. pu on-

sumptionwasused asa metri. pu multiplexors hadwfq shedulers, where

shares gave the three resouresa minimum of

50%

of pu resoures (shared equallyamongthemselves). Theremainingpuresoureswereassignedtopro-

essesaordingto a

50%

,

33%

, and

17%

entitlement. The sameentitlement wasused fortheproessesattheasr,aior,andfrshedulers.

Figure 10 shows pu utilization at the dierent resoures involved in the

experiment. Wesee that thebulk ofpu onsumption isby the threads (ap-

proximately

45 + 30 + 15 ∼ = 90%

). This isdue to howVorteximplements the 8

Aftertherstreadoperationthe targetle isahedinmemorybythefr. Thus,in

thefollowingweignoreanyotherlesystemrelatedresoures.

(20)

AIOR ASR FSR TR filehog50 TR filehog33 TR filehog17

0 10 20 Time (seconds) 30 40 50 60 70 80 90 100

0 1 2 3 4 5 6 7 Core # 0 5

10 15 20 25 30 35 40 45 50

% CPU utilization

Figure10: Breakdownofpu utilization.

posix asynhronousi/o apiVortex avoids opyoperations onthei/o path,

makingreaddataavailabletoaproessthrougharead-onlymemorymapping.

For the reeiving thread, data is opied into the buer speied in the aio

ontrolblokdesribingtheoperation.

Figure11showsabreakdownoftherelativepuutilizationattributedtothe

proessesatallresouresandthethreads. FromFigure11(a)weonludethat

pumultiplexorwfqshedulersoperateasexpeted;threadsauratelyreeive

exesspuresoures,i.e.entitledresouresnotusedbytheasr,aior, orfr,

proportionally to their

50%

,

33%

, and

17%

entitlement. The pu resoures availabletothethreadstranslateintoaorrespondingpuonsumptionatthe

asr,aior,andfrresoures,asshownin gures11(b)(d).

So,theexperimentnotonlydemonstratesthat resoureonsumptionisa-

uratelymeasuredandattributed (goal1and2ofSetion4),but alsothat the

shedulershavesuientontrolto isolateamongompetingativities(goal3

ofSetion4).

4.4 Web server workloads

Wefurtherinvestigateattributionandisolationunderompetitionbyonsider-

inganexperimentwith(1)shedulersusingmetrisotherthanputime(bytes

writtenandread),(2)resoureonsumptionthatisinherentlyunattributableat

thetimeofonsumption(paketdemultiplexingandinterruptproessing),and

(3)ani/odevieratherthanthepuasabottlenektoinreasedperformane.

The experiment also exerises a larger number of resoures and represents a

Relative CPU utilization

(d)AddressSpaeResoure.

Figure11: Breakdownofrelativepuutilization.

Web serversoftware thttpd 9

was run,with modiationsto exploit Vor-

tex's asynhronous i/o api and event multiplexing mehanisms. thttpd is

single-threadedandevent-driven. Togenerate loadtothewebservers,weran

ApaheBenh 10

onthreeseparateLinuxmahines. Oneahmahine,ApaheBe-

nh wasonguredto generaterequests forthesame

1

MB statiwebpage re- peatedly and with aonurrenylevel of

16

. Priorto theexperiment, testing revealed ApaheBenh ould saturate a

1

Gbit network interfae even from a singlemahine. ThethreeLinuxmahinesouldtogethergenerateloadwellin

exessofnetworkinterfaeapaity.

Table 1 lists the Vortex resoures used by the web servers. By default,

Vortex manifests a network devie driver as tworesoures: the Devie Write

Resoure (dwr) and the Devie Interrupt Resoure (dir). In the ase of a

networkinterfaeard(ni)driver,insertionofpaketsintothetransmitringis

performedundertheauspiesofdwr.Transmit-nishedproessingandremoval

ofreeivedpaketsfromthereeiveringishandledbydir.

dirreeivedpakets,intheformofrequestmessages,aresenttotheNetwork

DevieReadResoure(ndrr)fordemultiplexing. Byinspetingpaketheaders,

ndrrdetermineswhetherapaketisdestinedforanopentponnetion,isa

9

http://www.ame.om/software/thttpd/thttpd.html

10

http://www.apahe.org/

(22)

Resoure Desription

DevieInterruptResoure(DIR) NICinterruptproessing

DevieWriteResoure(DWR) Insertpakets intoNICtxring

NetworkDevie WriteResoure(NDWR) Insertethernetheaderintopaket

NetworkDevie ReadResoure(NDRR) Demultiplexinomingpakets

TCPResoure(TCPR) ProessTCPpakets

TCPTimerResoure(TCPTMR) ProessTCPtimers

Asynhronousi/o Resoure(AIOR) Orhestrateasynhronousi/o

FileCaheResoure(FCR) Fileahing

AddressSpaeResoure(ASR) Addressspaemappings

synpakettargetingaonnetionin thelistenstate,orisapaketthatshould

be dropped. Ifa tponnetion is found, then thepaket is sent to thetp

Resoure(tpr) forfurtherproessing. Notethat proessingbybothdirand

ndrr is onsidered infrastruture; the ativity to attribute is determined by

ndrraspartofdemultiplexing. Alsonotethatthereisnoseparateipresoure.

Sine ip ode is used only in onjuntion with reating tp or udp paket

headers,ipisaesseddiretlyinsteadofmanifestedasaresoure.

AsdesribedinSetion3.1.1,resouresassignrequestanitylabelstogive

shedulershintsaboutpumultiplexorpreferenes,andtheyassigndependeny

labelstoontrolrequest-proessingorder. Whenapaketisremovedfrom the

nireeivering, ananityand dependenylabel areassigned to therequest.

ndrr and tpr both aess elds in the paket header and the tp ontrol

blok. Sofor performanereasons,paketsbelongingto thesametponne-

tionideallywouldbeproessed onthe sameore. tprproessing ofpakets

in ni-dequeue orderis notarequirementfororretnessbut anprevent un-

neessarytp ommuniation. Forexample, thedefault poliy fortp when

reeiving out-of-orderpakets is to reply with an ak paket (whih, in turn,

mighttriggerfast retransmit). Also, the Vortextp stak ontains theusual

fast-pathoptimizationsforin-orderpaketproessing.

Topreservepaketordering,paketsfrom thesametponnetionareas-

signed the same dependeny label at intermediate resoures. Reall that the

sheduler toolkit only guarantees ordering between asending and areeiving

resoure. Toensure thatpaketsare proessedon thesameore, identialde-

pendenylabelsareassignedarossallintermediateresoures.

For inomingpakets,thedir determinesdependenylabels byinspeting

paketheadersandomputingahashofthesendingandreeivingipaddresses

andtp ports. The omputedlabel,whih isidential for allpakets belong-

ingto thesame tp onnetion,is inherited by all intermediate resoures. If

paketproessingreatesanewtponnetion,thenthatlabelisstoredinthe

tp ontrol blok and attahed to any paket sent. Thedependeny label is

(23)

omputedaordinglyforonnetionsreatedbyproessesrunningonVortex.

In the experiment, we ongured pu multiplexors with wfq shedulers.

Resoures at eah pu multiplexor were ongured with a

50%

entitlement (shared equally among themselves), with the remaining apaity split among

webserversaordingto a

50%

,

33%

,and

17%

formula. Sinethe webservers aresingle-threaded, theyonlydrawpu resouresfrom oneore. Topromote

ompetition, weongured tr shedulers with a load sharingalgorithm that

seletedthesamepu multiplexor forallthreads (ore

7

). Theresouregrid, shown in Figure12, wasongured withseparate wfqshedulers for eah re-

soure. Ateahresoureshedulerweonguredtheinfrastrutureativitywith

a

50%

entitlement,withtheremainingsplitamongthewebserversaordingto a

50%

,

33%

,and

17%

formula. Furthermore,shedulerswereonguredtouse puonsumptionasametri,exeptforthendrr,ndrw,anddwrshedulers

whih were onguredto use bytes transferred. The dwr resoureis instru-

mentedtoemitaresourereordwheneverawriteoperationisaeptedbythe

underlyingdriver(i.e.,apaketsuessfullyinsertedintothenitransmitring).

Likewise,thediremitsaresourereordwhenareadoperationompletes.

In Vortex, a resoure with insuient apaity rejets a request. Upon

rejetion,theshedulertoolkitplaestheorrespondingresoureinasuspended

stateandrequeuestherejetedrequestintheoriginatingqueue. Untilresumed,

nonewrequestsaresenttotheresoure.Fortheniinoursystem,dwrrejets

(24)

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100

0 10 20 30 40 50 60 70 80 90 100

Throughput (Mbits/s)

Time (seconds)

Total

httpd50

httpd33

httpd17

Figure13: Byteswritten atthedwr resoure.

a request if the ni's single transmit ring is full, after whih dwr remains

suspended untildirhasperformedwrite-ompletionproessing. dwrapaity

islimitedbythespeedatwhihthenianopypaketsfromthetransmitring

tothenetwork. Moreover,sineaesstothenitransmitringisserializedby

alok,onlyasingleoreaninsertpaketsatanygiventime. Thus,onguring

thedwr to request pu from multiple pu multiplexors would only result in

exessive ontention on the ni lok and not in inreased apaity. For this

reason, we ongured the dwr sheduler to request pu only from a single

pu multiplexor (ore

6

). Even when the niis running at full apaity and the dwr is frequently suspended awaiting dir proessing, dir proessing is

likelyto overlapwithattemptsto insertpaketsinto thetransmitring. Thus,

dir proessing is best performed on the sameore as dwr to avoidni lok

ontention 11

.

Figure 13 shows how network bandwidth is shared at the dwr resoure

duringourexperiment. ThedemandforbandwidthgeneratedbyApaheBenh

is thesame for allweb servers. However,the atual bandwidth onsumedby

eahwebserverdependsonitsentitlement,aswedesired. Moreover,notethat

the total bandwidth onsumed is lose to the maximum apaity of the ni,

onrmingthat theworkloadisi/obound.

Figure14breaksdownpuutilizationarosstheinvolvedresoures. Forthis

workload,

28.3

% of available pu yles(the equivalentof

2.26

ores) is onsumed. Notsurprisingly,thebulkofpu onsumptionisbytpandresoures

11

Whendirproessing runs on a dierent ore fromthe dwr, we measured an overall

5.5%

^inreaseⁱⁿ^puonsumption. Lokprolingfurther showedthat the inreasewas all attributabletonilokontention.

(25)

downstream. Consumption of

14 . 24%

of available pu yles (the equivalent of

1 . 13

ores)anbeattributedto infrastruture. Of this,

7 . 2

% (

0 . 58

ores)is interrupt(i.e.dir)proessingandtheremainderis paket demultiplexing(i.e.

ndrr proessing). dir proessing takes plae on ore

6

; ndrr proessing is load-shared among ores due to anity label assignment. Observethat dwr

proessinghasarelativelyxedost;whennioperatesatmaximumapaity,a

relativelyonstantnumberofpaketsneedstobetransmitted(wheretheexat

numberdependsontpdynamis). Inontrast,theostofinterruptproessing

is heavily inuened by thefrequeny of interrupts,whih is bounded by the

rateat whihpaketsareremovedfrom thenitransmitring(i.e.atmostone

interruptperpaket sent). (The numberof interruptsdue to pakets reeived

has the same bound, but a ni operating at maximum transmit and reeive

apaityisnotlikelytoinreaseinterruptfrequenysinethedriverwouldo-

alesereeivewith transmit proessing. And the ni in our system does not

haveseparateinterruptvetorsfortransmitandreeive.)

Intheexperiment,oresweremeasuredtooperateatapproximately

15 ± 3

24000

inoming pakets perseond.

Thus, overhead related to removalof sent pakets from theni transmitring

is amortized over approximately

11

pakets on average. Reduing the load on ore

6

would only result in morefrequent serviing of interrupts, leading to more frequent interrupts, whih in turn inreases pu onsumption. We

experimentallyveried this feedbakeet by only runningthe dir and dwr

onore

6

. Itsloadstayedat

100

%. Theslightlyreduedper-interruptoverhead wassubsumedbytheinreasednumberofinterrupts.

Vortex requires resoures to handle onurrent exeution of requests. In

ourimplementation, we use spin-loks to preserve invariants on sharedstate.

Forthisexperiment,anaverageof

1, 770, 000

lokoperationsareperformedper seond.Themajorityprotetrequestqueueoperations. Lokprolingdidshow

some lok hotspots, indiating a need to re-visit synhronization approahes,

butoveralllokontentioninthisexperimentwasfoundtobelow(i.e.fewpu

ylesarespentbusy-waitingonloks).

Despite low lok ontention, the aggregatedoverhead oflok operationsis

signiant. Forthehardwareweareusing,obtainingandreleasingalokwhen

theoperationanbeexeutedinternallyinaore'saheinvolvesapproximately

210

puyles. Inpratie,duetotheneedforinter-oreommuniationwhen performinglokoperations,prolingshowstheaveragelokingoverheadto be

738

pu yles. In total,

22 . 2

% of onsumed pu yles are attributable to loking overhead and ontention. Inontrast,had all lokingoperationsbeen

exeutedinternallyin aore'sahe,only

6.3

% ofonsumedpuyles would

(26)

0 1 2 3 4 5 6 7 8

address_space asynchronous_io device_interrupt device_write file_cachenetwork_device_read network_device_write statistics tcp tcp_timer httpd17 httpd33 httpd50

% CPU Utilization (out of 8 cores)

httpd50 httpd33 httpd17 infrastructure

Figure 14: Breakdownofpuonsumption.

have been attributable as suh. The latter is to some extent optimisti, but

undersoresthat synhronizationisostlyinamulti-oreenvironment.

4.5 File system workloads

We ontinue by onsidering an experiment involving le i/o. Similar to the

web serverexperiment above, this experiment involves shedulers using bytes

transferredasametri,interruptproessing,andani/odevieasabottlenek

toinreased performane. Theexperimentdiers by(1) introduing aforeign

sheduler outsidediret ontrol of Vortex(the disk ontrollerrmware shed-

uler), (2) i/o devie apaitythat utuates depending onhow the devie is

aessed (i.e.whih disk setors are aessed and in what order), and (3) i/o

requestsof markedlydierentsizes 12

.

Theexperimentaldesigninvolvedthreeproessesperforminglereads.The

proesseseahranonethread perore,withthreadsprogrammedto readon-

urrently from

32

dierent,

2

MB, les. Eah le was onseutively opened, readusing

4

parallelstreamsfromnon-overlappingregions,andthenlosed. To ensurethat the experiment was disk-bound, eah le was evited from mem-

ory ahesafter it hadbeenread 13

. Eahproessthus maintainedonurrent

readoperationsfrom

256

dierentles, foratotal

768

les altogether. Before 12

Beforeoptimizationsperformedbythediskontrollerrmware, Vortexemploysan op-

timizationwherebyi/otoadjaentbloksisoalesed. Thisisanoptimizationemployedby

mostoperatingsystems.Vortexrestritstheoptimizationtorequestsbelongingtothesame

ativityandlimitstheresultingrequeststoenompasstransferofatmost

32

KBofdata.

13

Vortexsupportsne-grainedmanagementofahedles;mehanismsanreatehek-

pointsofthelesystemandevitlestateatthegranularityofindividuallesoraggregates

oflesusedbyspeiativities.

(27)

Resoure Desription

Devie InterruptResoure(DIR) Interruptproessing

DevieReadWriteResoure(DRWR) Insertreadorwriterequests

StorageDevieReadWriteResoure(SDRWR) Buertranslations

SCSIResoure(SCSIR) SCSImesssages

StorageResoure(SR) Exportdiskvolumes

EXT2Resoure(EXT2R) Ext2 lesystem

FileCaheResoure(FCR) Fileahing

Asynhronousi/oResoure(AIOR) Orhestrateasynhronousi/o

AddressSpaeResoure(ASR) Addressspaemappings

theexperimentwasstarted,anemptylesystemwasreatedondiskandles

werethenreatedandsynedtodisk. Fileswerereatedonurrentlytoavert

sequentialleblokplaementondisk 14

.

Table2liststheVortexresouresusedbytheproesses. Vortexmanifestsa

storagedeviedriverastworesoures:theDevieReadWrite Resoure(drwr)

andtheDevieInterruptResoure(dir). Insertionofdiskread/writerequests

isperformedbydrwr andrequest nishedproessingis handledby dir. The

StorageDevieReadWriteResoure(sdrwr)interfaesthestoragesystemwith

drwr. In partiular, sdrwr translates between storage-spei request and

data-buerrepresentationsandtherepresentationsthat areusedbyallVortex

devie drivers 15

. Sine the disks in our system were ssi-based, all requests

passedthroughthessiResoure(ssir)fortheappropriatessimessagere-

ationand response handling. ssir issituated upstreamof sdrwr and down-

streamoftheStorageResoure(sr). srabstratsdierenesindisktehnology

byprovidinganamingshemeandageneralblok-basedinterfaetoadiskor

diskvolume. Forexample,afterssirhasprobedtheunderlyingssitopology,

disovereddisks and raid volumes are registeredwith srasstorage volumes,

wherebyalesysteman beassoiatedwith them orrawaess anbemade

by e.g. le system reation and reovery tools. The Ext2 Resoure (ext2r)

is upstream of sr and implements the Ext2 le system on a storage volume

providedbysr. TheFileCaheResoure(fr)initiallyreeivesleoperations

andommuniateswithext2rtoretrieveandupdatelemeta-dataanddata.

Toensureaonsistentstateondisk, lesystemstypiallyrestrithowdisk

requestsanbeorderedaftersent. ext2rusesdependenylabelstosatisfyits

orderingonstraints. Requestsinvolvingbloksthatareprivatetoale(i.e.disk

bloktableanddatabloks)areassignedthesamedependenylabelbyext2r

andintermediate resoures,ausingrequests to arriveat thedisk in theorder

14

Asequential leblokplaementwouldresultinthemajorityofdiskrequeststobeof

thesamesizeduetooalesingofreadstoadjaentbloks.

15

Vortex denes ageneral request anddata-buer interfaethat alldeviedrivers must

adhereto.

(28)

sent . Notethatext2rassoiatestheoriginatingativitywiththeserequests;

externalsynhronizationprotoolsareassumedwhendierentativitiesoverlap

i/otoale. Forbloksontaininginformationpertainingtomultipleles(i.e.

inode bloks and free inode- and free-bitmap bloks), ext2r assoiates the

infrastrutureativitywithrequestsandassignsdependenylabelssimilarlyto

privatebloks. Useoftheinfrastrutureativityisneededfor onsistentstate

ondisk 17

, beausethe toolkit onlyguaranteesordering for requests belonging

tothesameativity.

Intheexperiment, pu multiplexorswere onguredwith wfqshedulers.

The resoure grid was ongured with separate wfq shedulers for eah re-

50%

,

33%

, and

17%

formula. Shedulers for resoures downstream of fr were ongured to use

bytestransferredasametri,sine,forthesetypesofresoures,puonsump-

tionisnotrepresentativeofatualresoureonsumption(seeSetion 2.2). For

thesamereasonsasthose outlined inthe webserverexperimentabove,drwr

anddir were onguredto request pu from asingleore(ore

6

). Thedisk rmwarewasonguredtohandleupto

256

onurrentrequeststoallowample opportunities forrmwaretoperformoptimizations.

Figure15showshowdiskbandwidthissharedatthedrwrresoureduring

theexperiment. Beausediskapaityvariedarossrunsdueto hangesinle

blokplaement,thegureshowsrelativebandwidthonsumptionforthethree

proesses. Thedemand for bandwidthis thesame forallthree proesses, but

asdesiredandseen,atualallotmentdependsonentitlement.

Figure 16 breaks down pu utilization arossthe involved resoures. For

thisworkload,only

0 . 99

% ofavailablepuyles(the equivalentof

0 . 08

ores) is onsumed, whih learly shows that the disk is the bottlenek to improved

performane.

5 Related work

5.1 Sheduling CPU

One Vortex objetive is to provide a exible framework for shedulers that

supports a wide varietyof poliies. Prior work has also explored support for

multiple, oexisting sheduling poliies. In ontrast to Vortex, the fous of

that work was guaranteeing pu yles for proesses. Of partiular relevane

to Vortex is work that investigates interation between shedulers organized

16

Software-basedrequestorderingtoreduediskheadmovementmightresultinadierent

disk-arrivalorder,but,similartooptimizationsperformedbydiskrmware,theorderingmust

satisfyonsistenymodels.

17

TheFileCaheResoureguaranteesthatnoreadsorwritesareinprogresswhensendinga

requesttoext2rthatinvolveslemeta-dataupdates. Thisrelievesext2rfromimplementing

logiforsynhronizingpendingreadsorwriteswithmeta-dataupdates.

(29)

15 20 25 30 35 40 45 50 55

0 10 20 30 40 50 60 70 80 90 100

Relative bandwidth consumption

Time (seconds)

fshog50

fshog33

fshog17

Figure15: Bytesreadatthedrwrresoure.

in a hierarhy, beause of the similar hierarhial relationship between pu

multiplexorsandresoureshedulers.

Goyalet al.[28℄presentoneofthersthierarhialshedulingsystemsthat

allows dierent algorithms for dierent appliations. The system uses a fair

queuing algorithm at alllevelsof the shedulinghierarhy,exept for theleaf

nodes. Leafnodesmay implement arbitraryshedulingpoliies, muh like the

Thread Resoure shedulers in Vortex. The open environment for real-time

appliations[19,20℄and bssi[40℄restritthenumberoflevelsinthehierarhy

to two, and thesesystems relyonan earliestdeadline rst(edf) sheduler at

theroottoresolvetimingonstraintsofappliationshedulers. RED-Linux[65℄

denesshedulingneedsoftasksin termsofattributes,whih maybeadjusted

toexpressdierentreal-timepoliies(edf, ratemonotoni,et.). Coneptually

thisdenesatwo-levelshedulinghierarhy.

pu inheritane sheduling[27℄ allowsonstrution ofarbitrary sheduling

hierarhiesbydesignatingertainthreadsasshedulerthreadsandotherthreads

aslientthreads. Shedulerthreadsimplementshedulingpoliiesbydonating

putimetolientthreads.Alientthreadan,inturn,atasashedulerthread

by donating its pu time to other threadsa onept originally introdued

in[17℄.puinheritaneshedulinganbeviewedasageneralizationofsheduler

ativations[1℄,onlyextendedwithpartsoftheshedulinghierarhyresidingat

kernel-level(although,theoriginalpuinheritaneworkonlydesribesauser-

levelimplementation). Nemesis[30℄,Aegis[24℄,andspin[8℄allimplementtwo-

levelshedulerhierarhieswithinterfaessimilartothatofshedulerativations.

Nemesisand Aegis require allseond-level shedulers to run at user-leveland

use axed sheduler at the root of the hierarhy; spinallowsappliations to

downloadtheirownshedulers intothekernelatrun-time.

Hierarhial loadable shedulers [55℄ (hls) and Vassal [13℄ both allow a

(30)

0 0.05 0.1 0.15 0.2 0.25

address_space asynchronous_io device_interrupt device_readwrite ext2 file_cachescsi statistics storage_device_readwrite storage_readwrite fshog17 fshog33 fshog50

% CPU Utilization (out of 8 cores)

fshog50 fshog33 fshog17 infrastructure

Figure 16: Breakdownofpuonsumption.

sheduler,downloadedintothekernelatrun-time,toontrolshedulingofavail-

ablethreads. Vassal onlyallowsa singlesheduler to o-existwith thenative

WindowsNTsheduler;hlsallowsarbitraryshedulerhierarhiesinWindows

2000. The hlsauthorsobservethat i/oativitiesseverely aetthe eetive-

nessandaurayoftheirpusheduling. Thisproblemisexpliitlyaddressed

in Vortex, beause it was designed to enfore poliies for both pu and i/o

onsumption.

5.2 Sheduling CPU and other resoures

Mostoperatingsystemshavewell-denedinterfaesforalloatingpu timeto

threadsorproesses,andtheshedulingalgorithmsmaybemodiedin arela-

tivelystraightforwardmanner. Inontrast,there isamultitudeof frameworks

andmehanismsforontrollingonsumptionofotherresoures.TheLinuxker-

nelusestimers,allouts,threads,andsubsystem-speiframeworkstodispath

workonbehalfofappliations. Asaresult,workthataimstomakeallresoure

onsumption shedulable in an existing system must overome the disparities

ofadiverseset ofmehanisms. Ifonlyertainresouresaremadeshedulable,

then inevitably there will be be ways to irumvent poliy enforement. For

example, ifonly network bandwidth is sheduled, then aweb serverould be

preluded from reahing its potential throughput by another disk-bound ap-