Contents lists available atScienceDirect
Journal of Computational Physics: X
www.elsevier.com/locate/jcpx
Massively parallel implicit equal-weights particle filter for ocean drift trajectory forecasting
Håvard Heitlo Holm
a,b,∗, Martin Lilleeng Sætra
c,d, Peter Jan van Leeuwen
e,faMathematicsandCybernetics,SINTEFDigital,P.O.Box124Blindern,NO-0314Oslo,Norway
bDepartmentofMathematicalSciences,NorwegianUniversityofScienceandTechnology,NO-7491Trondheim,Norway cInformationTechnologyDepartment,NorwegianMeteorologicalInstitute,P.O.Box43Blindern,NO-0313Oslo,Norway dDepartmentofComputerScience,OsloMetropolitanUniversity,P.O.Box4St.Olavsplass,NO-0130Oslo,Norway eDepartmentofAtmosphericScience,ColoradoStateUniversity,3915W.LaporteAve.,FortCollins,CO80521,USA fDepartmentofMeteorology,UniversityofReading,EarleyGate,ReadingRB66BB,UK
a r t i c l e i n f o a b s t r a c t
Articlehistory:
Received1October2019
Receivedinrevisedform14February2020 Accepted26February2020
Availableonline4March2020
Keywords:
Dataassimilation Particlefilters GPUcomputing Shallow-watersimulation Finite-volumemethod Drifttrajectoryforecasting
Forecastingofoceandrifttrajectoriesareimportantformanyapplications,includingsearch and rescue operations, oil spill cleanup and iceberg risk mitigation. In an operational setting,forecastsofdrifttrajectoriesare producedbasedoncomputationallydemanding forecasts of three-dimensional ocean currents. Herein, weinvestigate a complementary approachforshortertimescalesbyusingtherecentlyproposedtwo-stageimplicitequal- weights particlefilter applied toasimplified ocean model.To achieve this, wepresent a new algorithmic design for a data-assimilation system in which all components – includingthemodel,modelerrors,andparticlefilter–takeadvantageofmassivelyparallel computearchitectures,suchasgraphicalprocessingunits.Fastercomputationscanenable in-situ and ad-hoc model runs for emergency management, and larger ensembles for betteruncertaintyquantification.Usingachallengingtestcasewithnear-realisticchaotic instabilities, werun data-assimilationexperiments basedonsyntheticobservations from driftingandmooredbuoys,andanalyze thetrajectoryforecastsforthedrifters.Ourresults show that even sparsedrifterobservations are sufficient tosignificantlyimprove short- term driftforecastsup totwelve hours.With equidistantmooredbuoys observingonly 0.1%ofthestatespace,theensemblegivesanaccuratedescriptionofthetruestateafter dataassimilationfollowedbyahigh-qualityprobabilisticforecast.
©2020TheAuthors.PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCC BYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
Prediction ofdrift trajectoriesinthe oceanhas manyapplicationsthat are importanttosociety andtheenvironment.
Examples includesearch andrescue operations, recovering objectslost atsea, planningof boomplacements foroil spill cleanup,andpreventingcollisionsbetweenicebergsandoffshoreinstallations.Toproducehigh-qualitydrifttrajectoryfore- casts,itisimportanttohaveagoodrepresentationofoceancurrents.Thisisnotaneasytask,asoceancurrentshavelarge naturalvariabilityandtherearetypicallyfewavailableobservations.Furthermore,thesizeofoceanlow- andhigh-pressure systems,so-callededdies,ismuchsmallerthantheiratmosphericcounterparts,anditischallengingtoplacethemcorrectly intypicalgridresolutionsusedbyoperationaloceanmodelstoday.
*
Correspondingauthorat:MathematicsandCybernetics,SINTEFDigital,P.O.Box124Blindern,NO-0314Oslo,Norway.E-mailaddresses:Havard.Heitlo.Holm@sintef.no(H.H. Holm),m.l.saetra@met.no(M.L. Sætra),peter.vanleeuwen@colostate.edu(P.J. van Leeuwen).
https://doi.org/10.1016/j.jcpx.2020.100053
2590-0552/©2020TheAuthors.PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
The operational approach for drift trajectory prediction is to use the currents fromthe mostrecent ocean forecasts directly [1].Theseareimported fromcomputationallyexpensiveoceancirculationmodels,such asROMS [2], whichsolve the dynamicstate oftheocean inthreedimensions. Typically,alarge portionofthesimulationrun-timeisspent onthe data assimilation, which uses available real-world observations to correct the modeled ocean statesthat serve asinitial conditions forthenext forecast.Common forecastranges foroceancirculation modelsare threeto fivedays.Operational drift trajectory forecastsat the Norwegian Meteorological Institute (MET Norway) are produced by OpenDrift [1], which is an offline trajectory model that reads the ocean current forecasts to predict drift trajectories. Although OpenDrift is computationallyefficient,theoceancirculationmodelsstillrequireaccesstosupercomputers.
This paper explores the option of using a recently proposed filter method applied to a simplified ocean model for efficientdrift trajectoryforecasting.The aimisto buildadata-assimilationsystemthatcanrun efficientlyoncommodity- level desktopcomputers, andalsobe extendable tosupercomputers. We achieve this by usinga simplified ocean model and a data-assimilationmethodthat both are able to take advantage ofmassively parallelaccelerator hardware, such as the graphical processingunit (GPU). This work is not intended asa substitute ofcurrent operational systems, but asa complementary approach, in which the predicted currents may even be updated with in-situ observations, e.g., during ongoing search andrescueoperations. Furthermore,byenabling researchmodels torun onindividualdesktopandlaptop computers,researchersareabletodomorerapidprototyping.Atthesametime,thisworkwillcontributetomoreefficient simulationsalsoonsupercomputers,sinceallalgorithmsmaybeextendedtorunonmultipleGPUsandcomputenodes.
The paperisorganized asfollows:We startby highlightingour contributionsandreviewingrelatedwork relevantfor Lagrangian dataassimilationwithacceleratedparticlefilters.InSection 2,we describethedata-assimilationproblemand summarizethekey conceptsofso-calledproposal-distributionparticlefilters.We presentthesimplified oceanmodeland modelerrors inSection 3,whereasSection 4offersadetaileddescriptionofanalgorithm forrunningthechosen particle filteronthismodel.ThelattertwosectionsalsodiscusshowtheGPUisusedforefficientimplementationofthecomputa- tionallyintensivecomponents.InSection5,wepresentnumericalexperimentsofdrifttrajectoryensembleforecasts,using an identical-twinexperimentsetupdesignedtoresemblereal-worldoceancurrentsandwithconfigurationinspiredbyop- erationalsystems.Furthermore,weshowanddiscussthestatisticalvalidityoftheforecastsandexaminethecomputational performanceofthesimulations.Finally,Section6containsasummaryandconcludingremarks.
Papercontribution. Wepresentan efficientGPU-accelerateddata-assimilationsystembasedontherecentlyproposed im- plicit equal-weights particle filter (IEWPF) applied to simplified ocean models described by the rotating shallow-water equations. The data-assimilation algorithm, the numerical scheme for evolving the ocean model, and the algorithm for samplinglocallycorrelated,well-balancedrandommodelerrorsarealldesignedtotakeadvantageofmassivelyparallelar- chitectures.Thedata-assimilationsystemistailoredforobservationsoftheoceancurrentobtainedfromeitherfree-drifting buoysormooredbuoys.Weshownumericalexperimentsforassimilatingachallengingtestcasewithnear-realisticchaotic behavior, alongwithdrift trajectoryforecasts.Theresultsshowthat byassimilatingapproximately0.1%ofthestatespace, the posteriorensemble meanstronglyresemblesthe truestate intheentiredomain, thusenabling an accurate drifttra- jectory forecasts. This isalsothe first timethat the IEWPFmethodis applied tohigh-dimensional geophysical problems.
Furthermore,weshowthat theparticlefilterhaswell-behavedstatisticalproperties,andthatthecomputationalefficiency ofthedataassimilationiswell-balancedwithrespecttothemodel.Tothebestofourknowledge,thereexistsnoprevious massivelyparallelimplementationsofastate-of-the-artparticlefilterappliedtoahigh-dimensionalgeophysicalsystem.
Relatedwork. Particle filters, andmore generallySequential MonteCarlo (SMC)methods, constitute a large class ofnu- merical methods for statistical inference. It is well-known that the standard particle filter is prone to degeneracy in high-dimensional systems [3–5], and there have been several attempts at designing particle filters without this limita- tion. A few such particlefilters havebeen used onhigh-dimensional, near-realistic applications inthe geosciences.Ades and vanLeeuwen [6] use the equivalent-weightsparticle filteron a high-dimensional, simplified,ocean model basedon thebarotropicequations,showingthatitispossibletoavoidthedegeneracyprobleminhigh-dimensionalsystems,atthe cost ofa biasedestimate. Althoughthe schemeperformedwell,the biasgrowswithensemble size.Poterjoy,Sobashand Anderson [7] usealocalparticlefilterontheweatherresearchandforecastingmodel,inwhichthebootstrapparticlefilter isapplied locallytoobservations,andparticlestatesaremergedinstate spacebetweenthelocationsoftheobservations.
However, it remains problematic toglueparticles fromtheselocalupdatestogether to fullparticles that spanthe whole modeldomain.Theneededsmoothingcaneasilydestroydelicatebalances intheflow.Furthermore,theminimumsize of the localareasissetbyphysicallength-scale constraints,typically meaningthattoomanyobservationsarewithina local domaintoavoiddegeneracy.Inpractice,aminimumweightvalueisset,meaningthatnotallinformationisextractedfrom theobservations.Hence,localization isnotsolvingtheproblem.Arecentreviewby vanLeeuwenetal. [8] discussesmost recentdevelopmentsonparticlefiltersforhigh-dimensionalgeophysicalsystems.
SeveralimplementationsofstandardparticlefiltersforparallelarchitecturessuchasGPUsexist,butmainlywithinother scientific disciplines than geosciences.Lopez etal. [9] present GPU-implementations of a particle filter (with sequential importanceresampling)andauxiliaryparticlefiltertodetectanomaliesinmanufacturingprocesses,andshowsufficientper- formanceforreal-timeapplication.Gelencsér-Horváthetal. [10] introduceamodifiedcellularparticlefilterwithMetropolis resamplingon theGPUforreal-timeapplications.LibBi [11] isasoftware packageforstate-spacemodeling and Bayesian inferencecapableofutilizingGPUs.SeveralparticlefiltersareimplementedinLibBi,e.g.,particleMarkovChainMonteCarlo
(pMCMC)andSMC2.OthermethodsinLibBiincludetheExtendedKalmanFilter(EKF)andparameteroptimization routines.
BaiandHu [12] demonstrateparticlefilter-baseddataassimilationforsimulationofwildfirespread,withparallelsampling andweightcomputationbasedontheMapReduceprogrammingmodel.Inamorerecentwork,Baietal. [13] describemore efficientroutingofparticlesbetweenprocessingunitsintheresamplingstepofadistributedparticlefilter.
Otherdata-assimilationmethodshavealsobeensubjecttoGPU-accelerations.BlattnerandYang [14] giveaperformance study ofa GPU-implementation ofthelocal ensembletransformKalman filter,WeiandHuang [15] explorea GPU-based implementationoftheEKF,andQuinnandAbarbanel [16] presentageneralpathintegralMonteCarloapproachappliedto aneuronmodel.Theyallreportmassivespeed-upsontheorder100-1000overCPUimplementations.Theoreticalspeed-up basedonhardwarespecificationsforFLOPSandmemorybandwidthisontheorder10 [17].
AssimilationofLagrangiandataischallengingduetothepotentialcomplexityofthetrajectoriesandtheneedfortrans- formingthedatainto Eularianvelocity data(forfixed-gridorspectral numericalmodels).Apte, JonesandStuart [18] use particlesmoothingforassimilatingLagrangian datafromdriftersandpresentthreemethods forsamplingfromtheexact posteriorprobabilitydensityfunctionbasedontheLangevinequation andtheMetropolis-Hastings algorithm.Theirmeth- ods are showntoproduce better resultsthan theensemble Kalmanfilterusingperturbed observations. Spiller, Apte and Jones [19] usebothparticlefilteringandsmoothing(exactposteriorsampling)forassimilatingLagrangiandatafromgliders anddrifters. Theyproposea newobservationoperatorto dealwiththe highuncertainty inthelocationsofthe observa- tions.Spilleretal. [20] investigatethedivergenceofaparticlefilterforthepoint-vortexmodel.Theyintroducebacktracking particlefiltersandshowthatthefiltersoutperformEKFforthetwo-pointvortexsystem.Othermethodsthanparticlefilters andsmoothershavealsobeensuccessfullyimplemented [21–25].
2. Thedata-assimilationproblem
Therearemanypotentialsourcesforerrorsinthesimulationofatmosphericandoceanographicprocesses.Theseerrors mayarise fromphysicalprocessesmissinginthemathematicalmodel,discretizationerrorsinthenumericalmethod,sub- grid effects that can not be resolved in thediscretized model, anduncertainties in modelparameters, initial conditions, forcing andboundary conditions. Hence, we do not only wish to simulate the behavior of the unknown physical state, denoted by ψ, butrather its probability density function (pdf), p(ψ). As geophysical applications tend to be very high- dimensionalanddrivenbynonlinearprocesses,ananalyticdescriptionofp(ψ)isgenerallyunobtainable,andanensemble- based Monte-Carlo simulation is one way to measure the uncertainties in the system. In its simplest form, ensemble- basedstatisticalsimulation consistsofaset of Ne independentstate vectors{ψi}i=1,...,Ne,which areinitialized according to uncertainties in the model parameters andinitial conditions.The state of each ensemble member is then simulated independentlyaccordingtothemodelequation,
ψ
ni=
Mψ
ni−1+ β
ni−1,
forn=
1,
2, ...,
(1)inwhichthemodelMevolvesthesolutiondeterministicallyfromtimetn−1totn,andβni−1isanoptionalstochasticvariable thatrepresentsrealizationsoftheerrorsinthemodel.Thepdfofthesystemcanthenberepresentedthroughthestatistical propertiesoftheresultingensemble,e.g.,as
p
(ψ
n) =
1 NeNe
i=1
δ
ψ
n− ψ
ni,
(2)inwhichδistheDiracdeltafunction.
Ifan observation yn ofthesystemisavailable attimetn,thisinformationcanbeusedtoimprovetheobtainedproba- bilitydensity.Typically,theobservationisalsoinfluencedbyuncertainty,as
yn
=
Hψ
ntrue+
n,
(3)inwhichHistheobservationoperatorthatmapsthetruestateψntruetoobservationspaceand n isastochasticobservation error.Theobservationstypicallyonlycoverpartsofthesystem, sothatthesizeoftheobservationvector(denoted Ny)is smallerthanthesizeofstatevector(denoted Nψ).Thisisparticularlytrueforgeophysicalsystems,forwhichitisnormal that NyNψ (e.g., y can bethe value anddirectionof theoceancurrent atasingle point inspaceandtime).Because ofthis, we cannot simplyreplace theobserved partsofψn withthe valuesin yn directly, andwehave toconsiderthe conditional pdf p
ψn|yn
.Thedata-assimilationproblemconsistsoffinding thisconditional density,andits fundamental buildingblockisBayestheorem:
p
(ψ
n|
yn) =
p(
yn|ψ
n)
p(ψ
n)
p
(
yn) .
(4)The originalpdf p(ψn)is heretermedthe priorprobability,asit representsourunderstanding ofthesystem priorto as- similatingtheinformationintheobservation.Thelikelihood p(yn|ψn)expressestheprobability ofobserving yn underthe
assumptionthatψn isthetruestateofthesystem.Themarginalprobability p(yn),i.e.,theprobabilityofobserving yn,acts mainlyasanormalizationconstantandensuresthattheresultingposteriorprobabilitydensityisapdf.
2.1. Standardparticlefilter
Thestandardparticlefilterisanensemble-baseddata-assimilationtechniquethatusesadirectevaluationofBayestheo- rem.Eachparticle(equivalenttoan ensemblemember),ψi,isassignedaweight wi that givestherelative importanceof that particleintheensemble.Typically,all Ne particlesare initializedwithweight w0i =1/Ne,astheyare sampledinde- pendentlyfromthepdfoftheinitialconditions,p(ψ0).Eachparticleisthensimulatedindependentlyaccordingto(1) until observationtimetn.Byapplying(4) directlywith(2) asthepriordensity,andbyconsideringthemarginalprobabilityasa normalizationconstant,theposteriordistributionisexpressedas
p
(ψ
n|
yn) ∝
Ne
i=1
p
(
yn| ψ
ni)
Nej=1p
(
yn| ψ
nj) δ(ψ
n− ψ
ni)
=
Ne
i=1
wni
δ(ψ
n− ψ
ni).
(5)
Here,thelikelihoodisusedtoupdatetheweights wni foreachparticle,sothattheposteriorisrepresentedbyaweighted discretedistribution.Wecanevaluatethelikelihoodifweknowthepdffortheobservation.Forinstance,iftheobservation errorisGaussian, n∼N(0,R),theweightforparticleψibecomes
wi
∝
exp−
1 2 yn−
H(ψ
ni)
TR−1
yn
−
H(ψ
ni) .
(6)Assomeparticlesinevitablyendupwithverylowweights,theynolongercarrysignificantstatisticalvalue.Toimprove the statisticalcoverage inthe high-probabilityregions, theensemble isresampled accordingtothe weightdistribution in (6), so that {ψni}i=1,...,Ne ∼p(ψn|yn). All weights for the resampled particles are then reset to 1/Ne. This is known as sequentialimportance resampling.Severalschemescanbe usedforthisresampling [4],andinthisworkwe considerthe residualresamplingscheme [26].Notethatifthemodel(1) hasβ=0,itisimportantthatduplicatedparticlesaregivena perturbationtoavoidensemblecollapseandcompletelyoverlappingparticletrajectories.Withastochasticmodel,however, exactduplicationswillevolvedifferentlythroughindependentrealizationsofβi.
One ofthe main advantages of the standard particle filteris that it preservesall physicalproperties throughout the simulation,asthefinalparticles aregeneratedfromsuccessfulsimulationrunsandnot throughmanipulationofthestate vectors. A drawback, however, is that the ensemble is prone to collapse when the dimension of the observation space increases [3–5].Inhigh-dimensional systems,all particlesendup inthetailofthe likelihood,withtheconsequencethat onlyveryfew particles(perhapsevenjustone)gainamuchhigherweightthanallothers.Thedistributionthencollapses asall Ne particlesareresampledfromfew(orasingle)particlesthathavenon-zeroweights.Thisproblemisoftenreferred toasthecurseofdimensionality.
2.2. Theimplicitequal-weightsparticlefilter
One technique used forovercomingthe curse ofdimensionalityisto sample the statesψni froma proposaldensity,q, withanappropriate compensationintheweights.First,(1) showsthat thepdf ofthestate attimetn isrelatedtothatof theprevioustimebytheMarkovianproperty
p
(ψ
n) =
p
(ψ
n|ψ
n−1)
p(ψ
n−1)
dψ
n−1≈
1 NeNe
i=1
p
(ψ
n|ψ
ni−1),
(7)whereweassumedthatallparticleshavethesameweightattimetn−1.Inthestandardparticlefilter,wedrawtheevolution of theparticlefrom p(ψn|ψni−1), whichisequivalent to solving themodelequation forone time step.We canchoose it differently,byfirstmultiplyinganddividingtheargumentoftheintegralbyaproposaldensityq andthendrawtheparticle evolutionfromthatdensity,
p
(ψ
n) =
1 NeNe
i=1
p
(ψ
n| ψ
ni−1)
qi(ψ
n| ψ
n1−:N1e
,
yn)
qi(ψ
n| ψ
n1−:N1e
,
yn).
(8)We have large freedom in how to choose q, butthe support of q is required to be equal toor larger than the support of p(ψn|ψni−1),and itshould preferably be easy to sample from. Here, theproposal ischosen tobe conditioned on the
observation ynandallparticlestatesattheprevioustimestep,ψn1−:N1e,anditdependsontheparentstateψni−1viaindexi. UsingtheproposaldensityinBayestheorem(4) givesus
p
(ψ
n|
yn) =
1 NeNe
i=1
p
(
yn| ψ
n)
p(ψ
n| ψ
ni−1)
p(
yn)
qi(ψ
n| ψ
n1−:N1e
,
yn)
qi(ψ
n|ψ
n1−:N1e,
yn).
(9)Bynowsamplingψni∼qi(ψn|ψn1−:N1e,yn),theposteriorbecomes
p
(ψ
n|
yn) =
Ne
i=1
wni
δ(ψ
n− ψ
ni),
with wni=
p(
yn| ψ
ni)
p(ψ
ni| ψ
ni−1)
Nep(
yn)
qi(ψ
ni| ψ
n1−:N1e
,
yn) .
(10)Onechoiceofqistheoptimalproposaldensity[27],inwhichqi(ψn|ψn1−:N1
e,yn)=p(ψni|ψni−1,yn).Byconsideringalinear observation operator H andGaussian modeland observationerrors, β∼N(0,Q) and
∼N(0,R),the optimal proposal densityisequivalentto N(ψni,a,P),with
ψ
ni,a=
M(ψ
ni−1) +
Q HTH Q HT
+
R−1dni (11)
and P
=
Q−1
+
HTR−1H −1,
(12)inwhich
dni
:=
yn−
H M(ψ
ni−1)
(13)iscalledtheinnovationforparticlei.Theproposalisoptimalinthesensethatitgivesoptimalvarianceintheweightsfor proposalsoftheformq(ψn|ψni−1,yn),butasitturnsout,itisnotsufficienttoavoidensembledegeneracy [3,5,28].
Themainparticlefilterwewilluseinthisworkisan extensionoftheimplicitequal-weightsparticlefilter(IEWPF).In theIEWPF [29],q ischosensimilarbutnotidenticaltotheimplicitparticlefilter [30] bychoosingthenewparticlesas
ψ
ni= ψ
ni,a+ α
i1/2P1/2ξ
i,
(14) inwhichξi isa drawfromthestandardmultivariateGaussian distributionξi∼N(0,I)andα
i isafunctionofboth ξ and ψin−1.Furthermore,we chooseα
i suchthat the weights ofall particles becomeequal toa target weight,which isequal to the lowest optimalproposalweight of all the particles.This choice is neededto ensure that we keep all particles in theensemble, butcomeswithtwo drawbacks. Firstly,when thenumberofparticles increases, theworst particlewill be located furtherandfurther away fromthe observations,so thescheme enforces all particles tomove further away from theobservations.Secondly,numericalexperimentsshow thatthespreadoftheparticles becomesunderestimated inlow- dimensionalsystems(itsbehavior inhigh-dimensionalsystemsishardertoassessaswedonotknowthetrueanswer).Not withstandingthesenegatives,theIEWPFisthefirstparticlefilterthathasuniformweightsinhigh-dimensionalsystems.Toalleviatethesetwoissues,Skauvoldetal. [31] extendedtheschemebyproposinganupdateequationforeachparticle oftheform:
ψ
ni= ψ
ni,a+ α
i1/2P1/2ξ
i+ β
1/2P1/2ν
i,
(15) inwhichν
i isasecond randomvectorν
i∼N(0,I)andβ isacovariancescaling parametercommontoall particles.The introductionofthenewtermenablesustoremove theunderestimation oftheparticlespreadbytuning β.Furthermore, wecanchooseα
iandβ suchthatthetargetweightisequaltothemeanoftheoptimalproposalweights.Theconsequence ofthischoiceisthattheparticlesarenotforcedawayfromtheobservationswhentheensemblesizeincreases.Withthis, bothproblemsaresolved,andthisnewschemeisthebasisforournumericalexperiments.Detailsoftheschemearegiven inAppendixA.3. Simplifiedoceanmodelformassivelyparallelarchitectures
Traditional ocean circulation models [2,32] aregenerally written to resolve asmany of the physicalprocesses in the ocean aspossible,andtypicallyconsider conservationofmass,momentum,energy, andtracers(salt andtemperature)in threedimensions.Thismakesthemverycomputationallydemandingandlimitsthefeasiblenumberofensemblemembers.
Thenumberofmembersinanoperationalensemblepredictionsystemtodayisusuallybetween10and100.Insteadofafull three-dimensionaloceancirculationmodel,weassumethattheverticalvelocitiesarenegligiblecomparedtothehorizontal movement, andlet thenonlinear shallow-water equations in a rotational domainserve as a simplified model.Thus, we vastlyreducethestatespaceoftheproblem. Inoperationalsettings,thesimplifiedmodelmaybeinitialized basedonthe
mostrecentoceanstatefromatraditionaloceancirculationmodel,andbeusedfortoforecast short-termoceancurrents.
Furthermore,drift ofLagrangian objectsintheocean aretypically drivenby theocean currents,wind,andwave-induced forces(Stokesdrift) [33],whereasinthisworkweonlyconsiderthecontributionfromtheoceancurrents.
Theshallow-waterequationsareintheclassofhyperbolicconservationlaws,whichareoftensolvedusingexplicitfinite- volumemethods [34].Thisclassofproblemsiswell-suitedforefficientimplementationonmassivelyparallelhardware,such asGPUs [35–37].Byalsocarefullytailoringthedata-assimilationalgorithmstouselocaloperations,weareabletorunthe mostcomputationally demandingpartsofthe codeontheGPU. Controlflowandintrinsicserialoperations,however,are still carriedout on the CPU.This way,we useeach processor type forthe taskwhich it isbest suitedfor. Through this approach,wecanefficientlyrunanensembleofasimplifiedoceanmodeloncommodity-leveldesktopcomputers,reducing therequirementsforaccesstosupercomputers.
TheGPUisanextremecaseofamany-coreprocessor,withhundredsorthousandsofsimplecores.Measuredinfloating- point operationsper second(FLOPS), astandarddesktopGPUsurpassestheperformanceofthetop supercomputerinthe world ten years ago [38], and is today roughly ten times asfast as the CPU. GPUs were initially designed for efficient graphics operations, buthave becomeincreasingly popular forgeneral-purpose computingover thelast 15 years.Due to theirdesignforoptimizedthroughputofdata-paralleloperationsandlowpricesdrivenbythegamingmarket,theybecame attractiveacceleratorswhenthesteadilyincreasingCPUclockfrequencycametoanend [39].Programminglanguagessuch as CUDA and OpenCL, and easy access to highly specialized third-party libraries,1 debuggers and profilers, have further contributedtomakethemaccessibleforawiderangeofcomputationalproblems.
TheprogrammingmodeloftheGPUisaccessedthroughkernels,whichareprogramswritteninspecializedlanguagesfor runningontheGPUinaSIMD/SIMT(SingleInstruction,MultipleData/Threads)fashion.Thethreadsareorganizedinblocks, whichagainareorganizedinagrid.Thegrid(andblocks)canbe one-,two- orthree-dimensional,andtheidealchoiceof block-sizeconfiguration,denotedby(bx,by),willvaryfordifferentkernelsandfordifferentGPUs.Eachthreadcancommu- nicate withotherthreadsinthesameblockthroughthesharedmemory,whichcanbedescribedasaprogrammablecache orscratchpadmemory.Communicationbetweenthreadsindifferentblocks,however,requirescostlyglobalsynchronization.
The GPUdoesnot sharethemainCPUmemory,andallrequireddatathereforeneedstobeexplicitlytransferredbetween the GPU andCPU. This operation is relatively expensiveand should be minimized foroptimal performance. For a more thoroughintroductiontoGPUcomputing;see,e.g.,SandersandKandrot [40].
Toachievebothcomputationalperformanceandcodedevelopmentefficiency,wetreatthecomputationalintensivepart of the codeandthe program flow indifferent ways. PyCUDA [41] isa Python package that exposesthecomplete CUDA run-timeAPI andallows ustocallnativeGPUkernelswritteninCUDAdirectlyfromPython.Thisway,one canwritethe program flow,aswellaspre- andpost-processingofthespecific applications,inhigh-level Python,andatthesametime ensure thatthecomputationallyexpensivesimulationlooprunsasefficientaspossiblethroughlow-levelCUDAC/C++.By takingadvantageofwidelyavailable andpopularpackages–includingNumPy [42] andmatplotlib [43],andenvironments suchastheJupyterNotebook [44] –thecodeandexperimentscanbedevelopedefficientlythroughrapidprototyping.
Intheremainderofthissectionwegive anoverviewofthemodelandthemodelerrors,andshowhowweutilizethe GPUtoincreasecomputationalefficiency.
3.1. Thesimplifiedoceanmodel
Theshallow-waterequationsconsiderthreeconservedvariables;theelevation
η
ofthefreeoceansurfacerelativetoits equilibriumlevel,andthevolumetransporthuandhv alongtheabscissaandordinate,respectively.Theequilibriumdepth isgivenbyHeq andishereassumedtobeconstant,sothatthefullheightofthewatercolumnbecomesh=Heq+η
.With gravitationalaccelerationgandCoriolisparameter f,theshallow-waterequationscanbewritten( η )
t+ (
hu)
x+ (
hv)
y=
0, (
hu)
t+
hu2+
12gh2
x
+ (
huv)
y=
f hv, (
hv)
t+ (
huv)
x+
hv2+
12gh2
y
= −
f hu.
(16)
Theequationsrepresentahyperbolicconservationlaw,andcanbewritteninvectorformas
ψ
t+
F(ψ)
x+
G(ψ)
y=
Sf(ψ ),
(17)forastatevectorψ= [
η
,hu,hv]T.Here,FandGarefluxtermsalongtheabsiccaandordinate,respectively,andSf consists ofthesourcetermsduetotheCoriolisforces.The modeloperator M(ψ) will be the numericalscheme that solves (16) and evolvesthe state forward in time. We usethehigh-resolution central-upwindschemeproposedby Chertocketal. [45],butwithareformulationthat avoidsthe
1 BLAS,RNG,FFT,imageandsignalprocessing,collectivecommunicationprimitives,graphanalytics,etc.
expensiverecursiveformulationofCoriolispotential terms [46].The schemeisdesignedtobe well-balancedwithrespect tothegeostrophicbalance,
hu
= −
g Heq f∂ η
∂
y and hv=
g Heq f∂ η
∂
x,
(18)which permitsrotatingsteady-state solutionsby balancing thegravitationalandCoriolis forces.The numericalschemeis solvedonaCartesiangridM consistingofNM=nx×nycells.Thesizeofeachcellisx×y,sothatthecellwithindex (j,k),containingthevalueψj,k,isthecellcenteredat
(
xj,
yk) =
j+
12x
,
k+
12y
.
(19)The total size ofthe state vector ψ then becomes Nψ =3NM. The time integrationis solved by a second-order strong- stability-preserving Runge-Kuttamethod, andthe storagerequirement forthe schemeis therefore 2Nψ, asthe full state mustbestoredfortwoconsecutivetimesteps.
ThestepsizeofthenumericalschemeislimitedbytheCFLcondition,
tscheme
≤
1 4minx maxMu
±
g
(
Heq+ η ) ,
y maxMv±
g
(
Heq+ η )
,
(20)in which the dominating term is the speed of gravitational waves,
g(Heq+
η
). Even though such waves occur in the ocean,perhapsmostnotablethroughtides,theircontributiontodriftermotionislimited.Eddiesandotherrotation-driven dynamicsaremuchmoreimportant,buttheyoperateonlongertimescales.Nevertheless,theCFL-conditionin(20) mustbe satisfiedtoensurenumericalstability.Torunthedata-assimilationmodelonarelevanttimescale,wedecouplethemodel operatorM fromthetimestepofthenumericalscheme,andletthefixedmodeltimestept consistofasmanytscheme stepsasnecessary.Weevaluatetheconditionin(20) continuouslytoadapttscheme tothemostrecentmodelstate,using aCourantnumberof0.8.3.2. Smallscalemodelerrors
Toaccountforerrorsinourmodel(e.g.,missingphysics),weintroducesmall-scaleperturbationsthroughthestochastic variable,β= [δ
η
,δhu,δhv]T,sothatβ isapproximatelydrawnfromN(0,Q).Thismodelerrorisgeneratedbysamplinga randomvectorξ∼N(0,I)andapplyingacovarianceoperator,β =
Q1/2ξ.
(21)Thiserrorisaddedtothemodelstateaftereachmodeltimestept.Wedesignthecovarianceoperatorbasedontwore- quirements.First,sinceweaimtoimplementallcomponentsinthedata-assimilationsystemtorunefficientlyonmassively parallelarchitectures,wedesignthecovarianceoperator Q1/2 intermsoflocaloperations.Second,itisimportantthatthe stochasticmodelerrordoesnotintroducediscontinuitiesornon-physicalmodelstatestothesolution.
Tomaketheperturbationoftheoceansurfaceδ
η
sufficientlysmooth,itisgeneratedaccordingtoasecond-orderauto- regressive(SOAR)functiongivenbyδ η
j,k=
nx
a=1 ny
b=1 QSOAR1/2
j,k
,
a,bξ
a,b,
(22)inwhich
QSOAR1/2
(
j,k,
a,b) =
q01
+
dist(
j,k,
a,b)
L0 exp−
dist(
j,k,
a,b)
L0
.
(23)Here,q0 isascalingparameterfortheamplitudeofδ
η
,L0isameasureofthecorrelationlengthscale,anddist(j,k,a,b) istheEuclideandistancebetweenthecenterofthecellswithindices(j,k)and(a,b).Sincethecovariancebetweenpoints thatare farfromeachother relativeto L0 becomeszero,thecomputationalwork canbelimitedtooperateonlocaldata pointsonly,andthissatisfiesthefirstdesignrequirement.Equation(22) canthenbewrittenasδ η
j,k=
j+
cSOAR a=j−cSOARk+
cSOAR b=k−cSOARQSOAR1/2
j,k
,
a,bξ
a,b,
(24)in which cSOAR is our cut-off value, tuned so that there are no contribution to δ
η
j,k from a distance larger than cSOARmin(x,y)fromcellj,k.Operationssuchas(24) areverywellsuitedforimplementationontheGPU.A drawback to the expression in (24) is that the computational work and data dependency of the stencil is tightly connectedtotheratiobetweenL0 andthecellsize.Tohavebettercontrolofthisworkload,weintroduceacoarserandom
Fig. 1.Alignmentofnestedgridswithc=3.ThegridM containscellsandisusedforevolvingthenumericalmodel,whereasthegridR contains pointvaluesandisusedforapplyingtheSOARfunctiononsampledrandomnumbersfromN(0,I).Forbestpossibleassimilationofobservations,anoffset canbeappliedtoR sothatoneofitsgridpointsisco-locatedwiththecellinMinwhichtheobservationwasmade.
numbergridR,onwhichthestandard normaldistributedrandom numbersξ are sampled,andapply theSOARfunction here. We choose the discretization of R so that we obtain a good trade-off betweencomputational efficiencyof (24), while maintaining a good spreadof informationwithin the correlated areas.The coarse grid will have grid cells ofsize (˜x,˜y)=c(x,y),wherec isan oddnumberrepresentingthecoarseness ofR.Valueson R areinterpretedas pointvalues,andwedenotethenumberofgridpointsinR byNR.Byrequiringthatcisodd,weensurethatthepoint values definedon R areco-located withcellcenters ofM,asshownin Fig.1.Furthermore,we choosethecoarsening factorcsothatthecut-offfactorin(24) canbechosenascSOAR=2.Afterhavingobtainedδ
η
onR through(24),weuse bicubicinterpolation,denotedbytheoperatorI,toobtaincell-averagedvaluesonM.Toavoidthat theperturbation β produces non-physicalmodelstates(the seconddesign requirement),weuse(18) to ensurethatβisingeostrophicbalance.Bydiscretizing(18) withcentraldifferencesontheM grid,δhuandδhvarefound fromδ
η
byδ
huj,k= −
g Heq fδ η
j,k+1− δ η
j,k−12
y and
δ
hvj,k=
g Heq fδ η
j+1,k− δ η
j−1,k2
x
.
(25)Thisoperationisdenotedby QG B1/2.Itshouldbenotedthatthederivativesofδ
η
areapproximatedby(25),eventhoughthey are analyticallyavailabledirectlyfromthebicubicinterpolation.Thereasonisthatgeostrophic balanceisonlymaintained by the numericalscheme withrespect to the grid resolution. The bicubic surface, however,is continuously definedand will typically containoscillations on sub-grid scale,meaning that thederivatives ofthebicubic surface oftenwillnot be represented bythediscrete valuesonthegrid.The centraldifferencesin(25) arethereforebetter suitedforgeneratinga modelstatethatisinbalanceunderthenumericalscheme.Evaluatingthecompletemodelerrornowconsistsoffouroperations,
β =
Q1/2ξ =
QG B1/2IQSOAR1/2ξ,
(26) inwhichthefirststepistosampleξ∼N(0,I).Notethat QG B1/2 andQSOAR1/2 are linearoperators,whereas I isanonlinear stencil.TheinputandoutputforeachoftheoperationsareQSOAR1/2
:
R→
R,
I:
R→
M,
QG B1/2:
M→
3×
M,
(27)
makingthecovarianceoperatoractas
Q1/2
:
R→
3×
M.
(28)TheseoperationsareillustratedinFig.2.First,therandomfieldξ issampledonthecoarsegridR,andtheSOARoperator QSOAR1/2 isappliedtogenerateacoarsecorrelatedfield.Then,thecorrelatedfieldisinterpolatedontothecomputationalgrid M using I,andδhuandδhvarecomputedtobeingeostrophicbalancewithrespecttoδ
η
.It should be noted that our choice of the model error leads to a non-symmetric square root Q1/2, and that this implementation-oriented definitionof Q1/2 makes useofsignificantly fewer randomnumbersthan variablesin thestate vector. Using c=3,illustrated inFig.1,asan example,wesample one randomnumberforevery nine
η
variables,and none forhu andhv, since δhu andδhv are computedfrom(25). This resultsin one randomnumber forevery 27state variables.Itshouldfinallybe notedthatbecause Q1/2 isanonlinearoperatorduetothebicubicinterpolation,theβ’sare notstrictlyGaussiandistributed.This,however,isnotaproblemasthecovarianceoftheβ’sisstillsymmetricpositivesemi definite.Fig. 2.Thesmallscalemodelperturbationβ= [δη,δhu,δhv]T isgeneratedbyfirstsamplingrandomnumbersfromastandardnormaldistributionξ∼ N(0,I)onthecoarsegridR.WethengivetherandomfieldacovariancestructureaccordingtotheSOARfunctionQ1SOAR/2,beforeinterpolatingthecoarse randomfieldontothefinemodelgridM throughItogetδη.Finally,wecalculatethecorrespondingmomentumδhuandδhvtoimposegeostrophic balance.
3.3. Efficientimplementationofmodelerrors
The SOAR function, bicubic interpolation, and geostrophic balance are all local stencil operations that are simple to parallelize, as each element of their output can be found independently from all other output elements. Generation of randomnumbersξ canfurtherbedonethroughthecuRANDlibraryavailablethroughtheCUDAtoolkit.Thesamplingofβ isthereforewell-suitedforimplementationontheGPU.
TheSOAR functionin(24) with cSOAR=2 consistsofastencil operationdepending on5×5 inputvaluescenteredon thetargetcell.WeuseoneGPUthreadperoutputelement.Tominimizetheamountofdatareadfromglobalmemory,all threadswithinthesameblockcooperatetoreadthecollectivelyrequiredinputdataintosharedmemory.
Inthebicubicinterpolation I,eachvalueinthefinegridM dependsonthe4×4 points inthecoarsegridR that surroundsitsposition.Thismeansthatthec×coutputvaluesthatarelocatedbetweenthesamefourcoarsegridpoints haveoverlappingdatadependencies.WestillapplyoneGPUthreadperoutputelement,andobtaingeostrophicallybalanced δhuandδhvwithinthesamekernel.Eachblockcomputes(bx+2)×(by+2)valuesofδ
η
andstoresthemtemporarilyin sharedmemory,sothatbx×byvaluesofδhuandδhv efficientlycanbecomputedusing(25).Thememoryfootprintofobtainingβ istwobuffersofsize NR,holdingξ andtheresultfrom QSOAR1/2 ξ,respectively.The memoryfootprintoftherandomnumbergeneratorcomesinadditiontothis.Notethatweneverstoreβitself,butaddit directlyintothestatevectorψ.
3.4. Synthetictruthandobservations
The experiments in thispaper are so-calledidentical twinexperiments, meaning that the same modelequations are usedtogeneratethesynthetictruestateandtoevolvetheensemble.The truestate ψtrue isgeneratedfromaknownset ofinitial conditionsby runningthe numericalschemewithstochastic modelerrors asdescribed above.Furthermore, ND Lagrangian drifters(driftingbuoys)are simulatedtobe advected passivelyalong theoceancurrentaccordingto asimple forwardEulerintegrationscheme.