jou rn al h om ep a ge :w w w . i n t l . e l s e v i e r h e a l t h . c o m / j o u r n a l s / c m p b
Bayesian network modeling: A case study of an epidemiologic system analysis
of cardiovascular risk
P. Fuster-Parra
a,b,∗, P. Tauler
b, M. Bennasar-Veny
b, A. Lig˛eza
c, A.A. López-González
d, A. Aguiló
baDepartmentofMathematicsandComputerScience,UniversitatIllesBalears,PalmadeMallorca, BalearesE-07122,Spain
bResearchGrouponEvidence,Lifestyles&Health,ResearchInstituteonHealthSciences(IUNICS), UniversitatIllesBalears,PalmadeMallorca,BalearesE-07122,Spain
cDepartmentofAppliedComputerScience,AGHUniversityofScienceandTechnology,KrakówPL-30-059,Poland
dPreventionofOccupationalRisksinHealthServices,GESMA,BalearicIslandsHealthService,HospitaldeManacor, Manacor,BalearesE-07500,Spain
a r t i c l e i n f o
Articlehistory:
Received18August2015 Receivedinrevisedform 28November2015
Accepted11December2015
Keywords:
Bayesiannetworks Modelaveraging Cardiovascularlostyears Cardiovascularriskscore Metabolicsyndrome
Causaldependencydiscovery
a bs t r a c t
Anextensive,in-depthstudyofcardiovascularriskfactors(CVRF)seemstobeofcrucial importanceintheresearchofcardiovasculardisease(CVD)inordertoprevent(orreduce) thechanceofdevelopingordyingfromCVD.Themainfocusofdataanalysisisonthe useofmodelsabletodiscoverandunderstandtherelationshipsbetweendifferentCVRF.
InthispaperareportonapplyingBayesiannetwork(BN)modelingtodiscovertherela- tionshipsamongthirteenrelevantepidemiologicalfeaturesofheartagedomaininorder toanalyzecardiovascularlostyears(CVLY),cardiovascularriskscore(CVRS),andmetabolicsyn- drome(MetS)ispresented.Furthermore,theinducedBNwasusedtomakeinferencetaking intoaccountthreereasoningpatterns:causalreasoning,evidentialreasoning,andintercausal reasoning.ApplicationofBNtoolshasledtodiscoveryofseveraldirectandindirectrelation- shipsbetweendifferentCVRF.TheBNanalysisshowedseveralinterestingresults,among them:CVLYwashighlyinfluencedbysmokingbeingthegroupofmentheonewithhigh- estriskinCVLY;MetSwashighlyinfluencebyphysicalactivity(PA)beingagainthegroup ofmentheonewithhighestriskinMetS,andsmokingdidnotshowanyinfluence.BNs produceanintuitive,transparent,graphicalrepresentationoftherelationshipsbetween differentCVRF.TheabilityofBNstopredictnewscenarioswhenhypotheticalinformation isintroducedmakesBNmodelinganArtificialIntelligence(AI)toolofspecialinterestin epidemiologicalstudies.AsCVDismultifactorialtheuseofBNsseemstobeanadequate modelingtool.
©2015ElsevierIrelandLtd.Allrightsreserved.
∗ Correspondingauthorat:DepartmentofMathematicsandComputerScience,UniversitatIllesBalears,PalmadeMallorca, BalearesE-07122,Spain.Tel.:+34971171386.
E-mailaddress:[email protected](P.Fuster-Parra).
http://dx.doi.org/10.1016/j.cmpb.2015.12.010
0169-2607/©2015ElsevierIrelandLtd.Allrightsreserved.
1. Introduction
BayesianNetworks(BNs)[1,2]alsoreferredtoasBeliefNetworks or probabilistic causal networks are an established frame- work foruncertainty management inArtificial Intelligence (AI). They constitute a tool which combines graph theory and probability theory to represent relationships between variables(nodesinthegraph)[3].Contrarytodeterministic understanding ofthe causality phenomenon[4], BN model- inghasitsoriginswithindataminingandmachinelearning research[5,6]and capturesprobabilisticinfluencesinduced outofbigdatasets.Theyconstituteapowerfulknowledgerep- resentationandanefficientreasoningtoolunderconditions ofuncertainty[7].Thenetworkstructureisadirectedacyclic graph(DAG)whereeachnoderepresentsarandomvariable [8,9]andthearcsaresuitableforrepresentingcausality[10].
BNshavebeenproventobeastrongtooltodiscoverthe relationshipsbetweenvariablesthatattemptstoseparateout directandindirectdependencies[11,12],andcancapturethe wayanexpertunderstandstherelationshipsamongallthe features[13].BNmodelingiswidelyusedinfieldslikeclin- ical decision support [14], systems biology [15,16], human immunodeficiencyvirus(HIV)andinfluenzaresearch[17,18], analyzes of complex disease systems [19–21], interactions betweenmultiplediseases[22],andalsoindiagnosticdiseases [23–27].
The metabolic syndrome is a set of risk factors that includeabdominal obesity, insulinresistance, dyslipidemia and hypertension leading to increased risk of developing cardiovasculardiseasesandtype2diabetes[28–31].Cardiovas- culardisease(CVD)epidemiologyisaworldwidepublichealth problem[32].TheeconomicburdenofCVDisalreadyaffecting theeconomiesoftheworld’swealthiestcountries.However, inthenextdecadesdevelopingcountrieswillbemoreaffected duetothegreatincreaseinCVDprevalenceexpectedinthese countries[33].Itisestimatedthatin2015,morethan20million peoplemaydieworldwidebecauseofCVD.Thisnumber is expectedtoincreaseintheupcomingdecades,thatevery5s intheworldamyocardialinfarctionwouldoccur[34,35].
CVDsare closelyrelatedtothe well-knowncardiovascu- larriskfactors(CVRF).TheconceptofCVRFappearedin1961, whenthegroupofKanneddefinedCVRFasbiologicaltraits orbehaviorsthatincreasedthechanceofdevelopingordying fromCVD[36,37].Thehighprevalenceofcertainriskfactors towhich we are exposed is the cause ofthis situation, in whichtheprevalenceofCVDisincreasedeveryyear.Itisnec- essarytocontrolthefactorsthatinfluencethedevelopment ofCVD,suchassmoking,hyperlipidemia,hypertension,dia- betes,obesity,adiethighinsaturatedfats,alcoholabuse,a sedentarylifestyle,andstress[38].Infact,WHO(WorldHealth Organization)estimatesthat80%ofprematuredeathsfrom cardiovascular disease and diabetes could beprevented by efficientcontrollingtheseriskfactors[39].
Therearesomescoresthatnumericallyquantifycardiovas- cularrisk(CVR).OneofthemostwidelyusedisFramingham score, withits calibrated form forthe Spanish population, theFramingham-REGICOR[35].Thisscaleestimatestheglobal CVRto10yearsanditisexpressedasapercentage.Recently, anewscorehasbeenproposed,theso-calledHeartAgetool
(HA),whichisbasedonFraminghamscore,andsupposesa simpleandgraphicwaytocommunicatetheCVRbecauseit expresses theCVRasanage.If theHAvalueisolderthan chronological age the term “lostyears”, definedas the HA minusthechronologicalage,couldbeused.TheHAisanovel concept designedspecificallyto helppeople tounderstand theirowncardiovasculardiseaseriskandimplementchanges intotheirlifestylestopreventtheincidenceofCVD[40].
Development and analysis of models to examine the relationshipsbetweendifferentCVRFcouldbenotonlyofthe- oreticalinterest,butcanserveasagenerictoolforapplication oriented activities: explanation, prediction, monitoring and prevention.Itenablesboththeoreticalanalysisoftherelation- shipsbetweennumerousvariables,andhavinginmindthe probabilisticnatureofthecausaldependencies,BNsseemto beanadequatetool.Moreover,BNmodelsarecapableofcreat- ingdifferentscenariosbasedonhypotheticalcaseswhennew observationsareinstantiated.
Thepaperis organizedasfollows.Section 2introduces BNs andsome basicconceptsforinference flow.Section 3 presents the materialsand methods forthe epidemiologic studyandtheprocessofinducingaBNfromadataset.Sec- tion 4showsdifferentreasoningpatternstoanalyzetheBN.
Section 5presentsadiscussion.Finally,Section 6concludes thepaper.
2. Bayesian networks
ABNconsistsof[41]:(i)asetofvariablesandasetofdirected edgesbetweenthesevariables,where(ii)eachvariable has a finite set ofmutually exclusive states, and (iii)the vari- ablestogetherwiththedirectededgesformaDAG.BNmodels estimatethejointprobabilitydistributionPoveravectorof randomvariablesX=(X1,...,Xn).Thejointprobabilitydistri- butionfactorizedasaproductofseveralconditionaldistribu- tions denotesthedependency/independency structurebya DAG:
P(X1,...,Xn)=
n i=1P(Xi|Pa(XGi)) (1)
Eq.(1)(where Pa(XGi)denotestheparentnodesofXi)isthe mainreasonfortheformulationofamultivariatedistribution byBNs;thisequationisalsocalledthechainruleforBayesian networks.
AsBNsare usedtomakeinference[8],itisnecessaryto understand theflowofinfluencewhennewinformationis introducedinaBN.Belowweintroducesomebasicconcepts.
TwovariablesXandYinaBNared-separatedif,forevery possiblepathbetweenXandY,thereisanintermediatevari- ableZsuchthateither:(i)theconnectionisserial(X→Z→Yor X←Z←Y)ordiverging(X←Z→Y)andZisinstantiated,or(ii) theconnectionisconverging(X→Z←Y)andneitherZnorany ofZ’s descendantshavereceivedevidence.Wheninfluence flowsfromanodeXtoanothernodeYviaanodeZ,itissaid thatthetrailXZYisactive.AcausaltrailX→Z→Y(serial connection),anevidentialtrailX←Z←Y(serialconnection) or,acommoncausetrailX←Z→Y(divergingconnection)is activeifandonlyifZisnotobserved.Acommoneffecttrail
X→Z←Y(convergingconnection)isactiveifandonlyifeither ZoroneofZ’sdescendantsisobserved.
Let P be a joint probability distribution of the random variables in some set of features F, the set of arcs is denoted by A, and a DAG G=(F, A); then (G, P) satisfies the local Markov condition if for each variable (feature) X
∈ F, X is conditionally independent of the set of all its non-descendantsgiventhesetofallitsparents.Theglobal MarkovpropertystatesthatanynodeXisconditionallyinde- pendent of any other node given its Markov blanket, i.e., I(X,non−markov−blanket(X)|markov−blanket(Xi));theMarkov blanketofanodeincludesitsparents,itschildren,andthechil- dren’sotherparents(spouses).AnynodeintheBNwouldbe d-separatedofthenodesbelongingtothenon-Markovblanket givenitsMarkovblanket.
3. Data and methodological issues
Thissection presentssomemethodologicalissuesconcern- ingdata acquisition.Reliabilityofdatawas assureddueto standardmedicalprocedures.Abriefdescriptionfollows.
3.1. Participants
Allparticipantswere workersfromthe publicsectorofthe BalearicIslands (Spain).Subjects inthe study were invited toparticipateduringtheir annualworkhealthassessment.
Anyworkerattendingtheworkhealthassessmentcouldbe includedinthestudy.4300workerswereinvitedtopartici- pate.Amongthem,3993subjects(Men=1758,Women=2235) agreedtoparticipate.Participantssignedinformed consent priortoenrollment.Afteracceptance,acompletefamilyand personal medicalhistory was recorded.The project ofthe studywasinaccordancewiththeDeclarationofHelsinkiand receivedapprovalfromtheBalearicIslandsClinicalResearch EthicalCommittee.
3.2. Instruments
3.2.1. Determiningvariables
Allanthropometricmeasurementsweremadeinthemorning after an overnight fast, and according to the recommen- dations ofthe International Standards for Anthropometric Assessment[42].Bodyweight(electronicscaleSeca700;Seca, Hamburg,Germany),height(stadiometerSeca220cm),and abdominal waist circumference (Lufkin Executive Thinline W606,precision1mm)weredeterminedaccordingtorecom- mendedtechniquesmentionedabove.Bodymassindex(BMI) wascalculatedasweight(kg)dividedbyheight(m)squared.
BMIvalueswerecategorizedfollowingthecriteriafromWHO [39].
Blood samples were collected during the same session and in the same place after an overnight fast of 12h.
Serumwas obtainedand totalcholesterol,HDLcholesterol, glucose, and triglycerides were measured using an auto- matedanalyzer(TechniconDAXsystem).Bloodpressurewas measuredwithacalibrated automaticsphygmomanometer (OmronM3).Measurementswererepeatedthreetimeswitha pauseof1minbetweenmeasurementsandtheaveragevalue
was recorded. To calculate physical activity practice, self- reportednumberofsessionsofphysicalactivityperweekwas obtained.
3.2.2. Determiningcardiovascularriskvariables
Thepresenceofmetabolicsyndrome(MS)wasascertainedby usingthecriterionsuggestedbytheNationalCholesterolEdu- cationalProgramAdultTreatmentPanelIII(NCEPATPIII).The FraminghamequationcalibratedfortheSpanishpopulation (Framingham-REGICOR)wasusedtodeterminethecardiovas- cularriskat10years(softwaretoolcalcumedplus,availableat http://www.fisterra.com).Classificationoftheparticipantsin thestudyaccordingtocardiovasculardisease(CVD)riskwas the Framingham-REGICOR guidelines:>10% High risk CVD, 5–9.9%ModerateriskCVD,<5%LowriskCVD[43].
TheheartagewascalculatedusingtheHeartAgeCalcula- tor,availableathttp://www.heartage.me.Cardiovascularlost years(CVLY) isdefinedasthedifferencebetweentheheart ageandthechronologicalage[44].CVLYtakesthevalues:First Quartile[−20,−4],SecondQuartile[−3,3],ThirdQuartile[4, 12],andFourthQuatile[13,20].
With slight differences between them, the parameters requiredforcalculatingtheFramingham-REGICORscoreand theheartageare:age,sex,height(incentimeters),weight(in kilograms),waistcircumference(incentimeters),familiarhis- toryofcardiovascular diseases,thepresenceor absenceof diabetes,smokinghabit,totalcholesterolandHDL-cholesterol levels, and systolic pressure or antihypertensive treatment [45].
3.3. LearningBayesiannetworks
ToobtainaBN,itisnecessarytodetermineastructure(defined byaDAG)andtheconditionalprobabilitiesassignedtoeach nodeoftheDAG.Therefore,tolearnaBNimpliestwotasks:
(i)structurallearning,thatis,theidentificationofthetopology oftheBN,and(ii)parametriclearning,thatistheestimationof numericalparameters(conditionalprobabilities)givenanet- worktopology.
3.3.1. Structurallearning
Theproblemofdiscoveringthecausalstructureincreaseswith thenumberofvariables[46–48].Table 1showsadescription ofthevariablesconsidered.
WeareinterestedinobtainingaDAG,soonlythreepossible connectionsareconsidered.Thenumberofdifferentstruc- tures,f(n),growsmorethanexponentiallyinthenumberof nodes,in[49]thefollowingefficientlycomputablerecursive functionisgiveninEq.(2):
f(n)=
ni=1
(−1)i+1 n!
(n−1)!n!2i(n−1)f(n−1) (2)
Therearetwoapproachestostructurelearningthatcould basicallybeconsidered[50]:(i)search-and-scorestructurelearn- ing,and (ii)constraint-basedstructurelearning; combination ofbothgivesahybridlearningframework.Search-and-score searchalgorithmsassignsanumber(score)toeachBNstruc- ture,andthenthestructuremodelwiththehighestscoreis
Table1–Descriptionof13datasetfeaturesusedtolearnthestructure.
Variablename Description Values
Gender MaleandFemale Men,Women
Age Ageinyears 35–44,45–54,55–64
Smoking Neversmoker,Formersmoker Neversmoker,Formersmoker,
andCurrentsmoker Currentsmoker
PA Physicalactivity(threeor Nopractice,Practice
moretimes/weekduring1h)
BMI Bodymassindex(kg/m2) Underweight,Normalweight,
OverweightGI,OverweightGII, ObesityTI,ObesityTII,ObesityTIII
WC Waistcircumference(cm) High,Normal,Veryhigh
BP Bloodpressure(mmHg) Normal,Optimal,Normalhigh
Mild,Moderate,Serious
HDL HDL-cholesterol(mg/dl) Normal,Low,High
CVLY Cardiovascularlostyears Firstquartile,Secondquartile
Thirdquartile,Fourthquartile
Glucose Fastingbloodglucose(mg/dl) High,Normal
TG Triglycerides(mg/dl) Normal,Limit,Hyper
CVRS Framingham-REGICORscore Low,Moderate,High
MetS Metabolicsyndrome Yes,No
chosen.Constraint-basedsearchalgorithmsestablishasetof conditionalindependenceanalysisonthedata[51].Usingthis analysisanundirectedgraphcouldbegenerated.Takinginto accountadditionalindependencetest,thenetworkistrans- formedintoaBN.Hybridalgorithmscombineaspectsofboth constraint-basedandscore-basedalgorithms,theyusecon- ditional independencetestto reducethe searchspace and networkscore to findthe optimal networkin the reduced space.
InordertoobtaintheDAG,weusedthebnlearnpackage [52,53]ofRlanguage[54].Astherearemanystructuresthatare
consistentwiththesamesetofindependencies,priorknowl- edgeofthesystem understudy wastakeninto accountin modelselectionprocess;tochooseastructurethatreflectsthe causalorderanddependencies,thatisthosecausesarepar- entsoftheeffects,areconsideredstructuresthattendtowork well[1],causalgraphstendtobesparser.Causalitywouldbe intheworld,notintheinferenceprocess.
We included our prior knowledge of the system under study intothemodelselectionprocess,thusvariableswere divided into four blocks: (1) background variables={Gender, Age}, (2) conditional variables={Smoking, PA}, (3) intermediate
Fig.1–Structureobtainedbymodelaveragingover500networks.Itwasbuiltwiththehillclimbinglearningalgorithmhc frombnlearnpackageinRlanguageusingathreshold=0.85.Inmodelselectionprocessweincludedpriorknowledge,thus variablesweredividedintofourblocks:(1)backgroundvariables={Gender,Age},(2)conditionalvariables={Smoking,PA},(3) intermediatevariables= {BMI,TG,WC,HDL,BP,Glucose},and,(4)diagnosticvariables={CVLY,CVRS,MetS}.
Table2–ExpectedvaluesofprobabilitiesforSmokingfeatureconditionaloncombinationsofitsparentvalues,inthis caseconditionalonGenderandAgefeatures.
Gender Age Smoking=Former Smoking=Current Smoking=Never
Men 35–44 0.0668 0.3636 0.5695
Men 45–54 0.0845 0.3825 0.5329
Men 55–64 0.1122 0.2852 0.6026
Women 35–44 0.1139 0.3231 0.5630
Women 45–54 0.1415 0.3371 0.5206
Women 55–64 0.1348 0.1311 0.7341
variables={BMI, TG,WC, HDL, BP, Glucose},and,4)diagnostic variables={CVLY,CVRS,MetS}.Werestrictedthemodelselec- tionprocessbyblacklistingarrowsthatpointfromalatertoan earlierblock[55].Toobtainthestructure,twooptionseither selectasinglebestmodelorobtainsomeaveragemodel,which isknownasmodelaveraging[56].Ourmodelwaslearntbyhill- climbing(hc)algorithm.Thefinalmodelwasobtainedrepeating severaltimesstructurelearning,alargenumberofnetwork structureswere explored(500BNs)to reducethe impactof locallyoptimal(butgloballysuboptimal)networksonlearn- ing.Thenetworkslearnedwere averaged toobtain amore robustmodel.Theaveragednetworkstructurewasobtained usingthearcspresentinatleast85%ofthenetworks,which givesameasureofthestrengthofeacharcandestablishesits significancegivenathreshold(85%)(seeFig.1).
3.3.2. Parametriclearning
Parameterswereobtainedagainwiththebnlearnpackagein RlanguagebyperformingaBayesianparameterestimation usingtheDirichletdistribution[57].
Aconditionalprobabilitydistributionisobtainedforeach node.InTable2anexampleofconditionalprobabilitydistri- butionisshown.
3.4. Cardiovascularriskmodel
AlthoughthebnlearnpackageinRallowsustomakeinfer- ence,inordertohaveacleargraphicalrepresentationfromthe structureandparametersobtainedwithbnlearninRlanguage theBNwasimplementedinNetica[58].Thecompilednetwork isrepresentedinFig.2.Thejointprobabilitydistributionofthe BNinFig.2requiresthespecificationof13conditionalproba- bilitytables,oneforeachvariableconditionedtoitsparents’
set.
AswecanobserveinFig.2,CVLYandCVRSvariableshave adirectconnection,andbothareconnectedtoMetSvariable throughdifferenttrails,e.g.,MetSvariableisconnectedtoCVLY variablethroughBPvariable(BPisacommoncause),onceBPis instantiatedtheconnectionviathistrailisbroken),andMetS variableisalsoconnectedtoCVRSvariablethroughTGvariable
Fig.2–BNforthestudyoffeaturesrelationshipstoevaluateCVLY,CVRSandMetSfeatures.TheBNshowsanoptimal(46.8%) bloodpressure(BP),normal(82.7%)triglycerides(TG),normal(87.2%)Glucose,normalweight(43.2%)(BMI),andpractice physicalactivity(PA)(47.7%)andnopracticephysicalactivity(PA)(52.3%).ItalsoshowslowlevelsofFramingham-REGICOR score(CVRS)(91.8%),nometabolicsyndrome(MetS)(88.3%)andsimilarcardiovascularlostyears(CVLY)inthefourquartiles.
(itisalsoacommoncause,onceTGvariableisinstantiatedthe connectionviathistrailisbroken),howeverthereareother possibletrailssuchas:MetS←HDL←Gender→CVLY,MetS← WC←Gender→CVLY→CVRS,etc.
ThefinalBNobtainedfromthedatasetshowsaHighlike- lihoodinLowvalueofCVRSvariable,aHighlikelihoodinNo valueofMetSvariable,aHighlikelihoodinnormalvalueofGlu- cosevariable,aHighlikelihoodinNormalvalueofTGvariable,a highlikelihoodinNormalvalueofBPvariable,ahighlikelihood inNormal valueofHDLvariable,ahighlikelihoodinNormal weightvaluesofBMIvariable,ahighlikelihoodinNormalvalue ofWCvariable,ahighlikelihoodinNevervalueofSmokingvari- able,andsimilarlikelihoodsinthedifferentlabelsofCVLYand PAvariables.
3.5. ValidationoftheBN
TheBNwasvalidatedusinga10-foldcross-validationforBN, usingalog-likelihoodlossfunction,obtaininganexpectedloss of9.3895.InTable3,theareaundertheROCcurve(AUC),and thepercentagecorrectlyclassifiedforthedifferentfeaturesis shown.
3.6. Performancecomparison
InordertoprovidereferencebenchmarksabouthowourBN classifies, we also report other classification performances (seeTable4)obtainedbythewidelyusedNaïveBayes(NB), Tree Augmented Naïve Bayes (TAN), Multilayer Perceptron (MLP),andtheC4.5decisiontreealgorithmintegratedinWEKA [59]. Onlythe diagnostic features (CVLY, CVRS, MetS) were consideredasacomparativeexample.Performanceofeach classificationmodelisevaluatedusingthreestatisticalmeas- ures:accuracy,sensitivityandspecificity.
LearningaBNfromdataisaformofunsupervisedlearning, inthesensethatthelearnerdoesnotdistinguishtheclass variablefromtheattributevariablesinthedata[60].Wecom- pareourBNwithseveralsupervisedlearningalgorithms:NB, TAN,MLP,andtheC4.5decisiontree.
NBand TANclassifiersare specialtypesofBN, wherea supervisedlearnisperformed.NB isaprobabilistic graphi- calclassifierbasedonBayestheoremwhichusesverystrong assumptionsontheindependencebetweenthepredictorvari- ables.TheNBmodelassumesthatinstancesfallintooneof anumber ofmutually exclusive classes,and it isthe sim- plestBNclassifier,wherethepredictivevariablesareassumed tobeconditionallyindependentgiventheclass.Theperfor- manceofNBissurprising,sincethisassumptionisunrealistic.
TheTAN classifier [60]extends the NB model witha tree- shape graph across the predictor variables. TAN model is similartoNBexceptthateach predictorvariableisallowed todependonotherpredictorvariableinadditiontotheclass.
ThismodelprovidesmoreinformationthantheNBmodelas itisincludedinformationabouttherelationshipamongall predictorvariables.MLPisafeedforwardartificialneuralnet- workmodelwhichconsistsofmultiplelayersofnodesina directedgraph,witheachlayerfullyconnectedtothenextone.
C4.5algorithmisadecisiontreeinductionmethoddevelopby Quinlan[61].
Table3–AUCsandpercentagecorrectlyclassifiedforthe differentfeatures.
Variablename State AUC Accuracy
Gender Men 0.9048 82.1938
Gender Women 0.9047 82.1938
Age 35–44 0.6756 53.4435
Age 45–54 0.6088 53.4435
Age 55–64 0.7273 53.4435
Smoking Formersmoker 0.6864 73.3534
Smoking Currentsmoker 0.8772 73.3534
Smoking Neversmoker 0.8117 73.3534
PA Nopractice 0.8763 78.6126
PA Practice 0.8773 78.6126
BMI Underweight 0.8242 55.1966
BMI Normalweight 0.8460 55.1966
BMI OverweightGI 0.7110 55.1966
BMI OverweightGII 0.7338 55.1966
BMI ObesityTI 0.8654 55.1966
BMI ObesityTII 0.8905 55.1966
BMI ObesityTIII 0.8638 55.1966
WC High 0.7487 73.0278
WC Normal 0.8677 73.0278
WC Veryhigh 0.9150 73.0278
BP Normal 0.7384 59.2787
BP Optimal 0.8902 59.2787
BP Normalhigh 0.7505 59.2787
BP Mild 0.8453 59.2787
BP Moderate 0.8805 59.2787
BP Serious 0.9408 59.2787
HDL Normal 0.7639 69.4465
HDL Low 0.8762 69.4465
HDL High 0.8806 69.4465
CVLY Firstquartile 0.9188 63.9600
CVLY Secondquartile 0.7926 63.9600
CVLY Thirdquartile 0.8238 63.9600
CVLY Fourthquartile 0.9335 63.9600
Glucose High 0.7274 87.1525
Glucose Normal 0.7277 87.1525
TG Normal 0.8523 84.5980
TG Limit 0.7953 84.5980
TG Hyper 0.8636 84.5980
CVRS Low 0.8095 91.2597
CVRS Moderate 0.8201 91.2597
CVRS High 0.8067 91.2597
MetS Yes 0.9836 96.4438
MetS No 0.9835 96.4438
ThemajoradvantageofBNistheabilitytorepresentand hence understandknowledge.OurBNmodelgivesthe best classificationperformances.Furthermoretheirgraphicalrep- resentationisveryinformative.
4. Reasoning patterns
BNsareusedtocalculatenewprobabilitieswhennewinfor- mation is obtained [8]. Given the evidence E=e, our goal is to find the most likely assignment to the variables in U=complementary(E),seeEq.(3):
MAP(U|e)=argmax
u P(u,e) (3)
There are twomaintypes ofqueries:(1) inaprobability query,wetrytofindthemostlikelyassignment toasingle
Table4–PerformanceforCVLY,CVRS,andMetSfeatures comparingourBNandusinga10-foldcrossvalidation experimentswiththecorrespondingalgorithms.
Algorithms CVLY CVRS MetS
Accuracy
Bayesiannetwork 63.9600 91.2597 96.4438
NaïveBayes 59.0033 90.4833 95.4921
TreeAugmentedNaïveBayes 63.8900 91.2580 96.0690 Multilayerperceptron 61.9835 91.2596 96.2434
TreesC4.5 62.1337 91.2597 95.4420
Sensitivity
Bayesiannetwork 0.6392 0.9131 0.9901
NaïveBayes 0.5901 0.9050 0.9550
TreeAugmentedNaïveBayes 0.6389 0.9126 0.9900 Multilayerperceptron 0.6200 0.9130 0.9620
TreesC4.5 0.6210 0.9130 0.9544
Specificity
Bayesiannetwork 0.8785 0.2874 0.7967
NaïveBayes 0.8610 0.2790 0.7920
TreeAugmentedNaïveBayes 0.8784 0.0874 0.7565 Multilayerperceptron 0.8720 0.0870 0.7670
TreesC4.5 0.8740 0.0870 0.7460
variable,i.e.tocomputeP(X|e);(2) inaMAPquery,wefind themostlikelyjointassignmenttothevariablesinU.Inorder tointroduceevidenceinthenetworkwehaveselectedthree reasoningpatterns:causalreasoning,evidentialreasoning,and intercausalreasoning.
4.1. Causalreasoning
Causal reasoningtakesplacewhenwepredict effectsfrom causes(ansoweproceedfromtoptobottom).Weinstantiate onevariableateachasinglestep.Instep1Gendervariableis instantiatedeithertoMenorWomen,instep2Smokingvari- ableisinstantiatedtoCurrentSmokerorNeverSmoker,instep 3physicalactivity(PA)variableisinstantiatedtoPracticeorNo Practice,instep4Agevariableisinstantiatedto35–44,instep5 Agevariableisinstantiatedto45–54,andinstep6Agevariable isinstantiatedto55–64.
4.1.1. AnalysisofcardiovascularlostyearsCVLYvariable InFig.3,asummaryabouthowthedifferentquartilesofCVLY variablechangesateachstepisshown.Takingintoaccount theconditionalvariables(Smokingandphysicalactivity)the onewithgreatestinfluenceoncardiovascularlostyearsCVLY isthesmokinghabit,obtainingtwoclearpatterns:(1)When Smoking isintheNeverstate,Fig.3shows thatthehighest probabilityisachievedforfirstquartileWomenfollowbythe secondandthirdquartilesinMen.AddingphysicalactivityPA variableinthePracticestateshowsadecreaseintheprobabil- ity forfourthquartileinMenand Women,showingslower values in Women;and also,an increase in the probability forfirstquartileinWomenandforsecond quartileinMen;
and (2) When Smokingis inthe Current state, Fig.3 shows that the highestprobability is achieved forfourth quartile MenfollowedbythethirdquartileWomen.Fig.3alsoshows
Fig.3–Stepbystepinstantiations.Thedifferentsteps:step1=Gender,step2=Smoking,step3=PA,step4=Age=35-44, step5=Age=45–50,and,step6=Age=55–64toevaluateCVLY.WhereS=Smoking,andPA=PhysicalActivity.Thedifferent stepsarerepresentedinthehorizontalaxis.TheestimatedprobabilityforCVLYvariableexpressedasapercentageatthe differentquartilesisshowedintheverticalaxis:M:Men,andW:Women.
Fig.4–Stepbystepinstantiations.Thedifferentsteps:step1=Gender,step2=Smoking,step3=PA,step4=(Age=35–44), step5=(Age=45–50),and,step5=(Age=55–64)toevaluateMetSfeature.WhereS=Smoking,andPA=PA=NoPractice.The differentstepsarerepresentedinthehorizontalaxis.TheestimatedprobabilityforMetSvariableexpressedasapercentage atthedifferentvalues(yes,no)isshowedintheverticalaxis:M:Men,andW:Women.
animprovement ofthe situation when physicalactivity is instantiatedtoPracticeandifthegroupoftheyoungestpopu- lationisconsidered,beingthegroupofMenwiththehighest risk.
4.1.2. StudyingmetabolicsyndromeMetS
From Fig. 4, we can differentiate two patterns taking into accountwhetherthesubjectspracticephysicalactivityornot (PAvariable).When physicalactivity(PAvariable)isinstan- tiated to Practice we obtain the highest probability in the No state for Metabolic Syndrome (MetS variable), showing thatSmokingvariable doesnothaveanyinfluenceandthe groupof Womenwere the mostprivileged (withthe high- est probability forMets variable inthe No state). However, when Physical Activity (PA variable) is instantiated to No Practiceweobserve thatforMetabolicSyndrome(MetSvari- able)intheYesstatetheprobabilityincreases,showingthat theSmokingvariabledoesnothaveanyinfluenceagain.The groupwiththehighestriskofgettingMetS=Yesisthegroup ofMen.
4.1.3. StudyingcardiovascularriskscoreCVRS
FromFig.5whenphysicalactivityisinstantiatedtoPracticewe obtainsimilarprobabilitiesforCVRSvariableindependentlyof whetherthesubjectsmokesornot.Similarlywhenphysical activityisinstantiatedtoNoPractice.
4.2. Evidentialreasoning
Queries,wherewereasonfromeffectstocases(frombottom toup),areinstancesofevidentialreasoningorexplanation.
MetsandCVRSvariablesareinstantiatedtovaluesYesand Highrespectively.Weobservehowtheprobabilityofthedif- ferent variableschanges.CVLYvariable increasesitsFourth quartilevaluefrom22.4%to90%.BPvariableincreasesitsMild valuefrom 16.1% to44.5%,anddecreases its Optimalvalue from46.8%to3.28%.TGvariableachievesimilarlikelihoodsfor allitsvalues:Normal,LimitandHyper.HDLvariableincreases itsHighvaluefrom16.8%to69.1%.WCvariableincreasesits VeryHighvaluefrom26.6%to68.8%.BMIvariabledecreases its NormalWeight valuefrom43.2% to11.1%,and increases theprobabilityofOverweightGIIfrom19.9%to28.4%,andthe probabilityofObesityT1from12.6%to33.7%.ThePAvariable increasestheprobabilityoftheNoPracticevaluefrom52.3%
to89.7%.TheGlucosevariable increasesits Highvaluefrom 12.8% to31.5%.TheGendervariableincreasesits Menvalue from44.0%to78.5%.TheAgevariableincreasesits55–64value from14.6%to50.7%.Fig.6showstheprobabilityvariations.
4.3. Conditionalentropy
InShannon[62]theory,entropyofXisthelowerboundonthe averagenumberofbitsthatareneededtoencodevaluesof X.Anotherwayofviewingtheentropyisasameasureofour
Fig.5–StepbystepinstantiationstoevaluateCVRSfeature.Instep1=Gender,step2=Smoking,step3=PhysicalActivity, step4=Age=35–44,step5=Age=45–50,and,Age=55–64.
Fig.6–Evidentialreasoning.MetabolicsyndromeMetSvariableisinstantiatedtoNovalueandCVRSvariableisinstantiated toHighvalue.
uncertaintyaboutthevalueofX,i.e.,littleuncertaintyabout Xwillproducealowentropyvalue.
AnaturalquestioniswhatisthecostofencodingXifwe arealreadyencodingY.TheconditionalentropyofXgivenY is
HP(X|Y)=EP
log 1 P(X|Y)
=
P(X|Y)·log 1
P(X|Y) (4)
which captures the additional cost (in terms of bits) of encodingXwhenweare alreadyencodingY.Notethatthe maximum valueofprobability inP(X|Y)implies the lowest entropyvalue.
For MetS, CVLY, and CVRS featureswe are interested in determiningandorderingthestatevaluesforconditionedfea- turessuchasweobtainthe maximumprobability valuein somestates,whichwillleadtoachievetheminimumcondi- tionedentropy.
4.4. Intercausalreasoning
Whendifferentcausesofthesameeffectcaninteractwetalk ofintercausalreasoning,whichconstitutesaverycommonpat- terninhumanreasoning.
Furthermore,BNsareabletoproduceprobabilityestimates, inthissenseweareinterestedinknowingthefeatureswith highestinfluenceinmaximizingMetS,CVLY,andCVRSinsome oftheirstates.
4.4.1. MinimizingconditionedentropyforMetS
WemaximizeMetSfeatureprobabilityinaYesstate.Toachieve it, we consider the Markov blanket of MetS variable, it is
Table5–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheMetSvariable, whereintheinitialBNwithoutevidenceMetS=Yes reachedaprobabilityof11.7%.Thedifferentvalues:
Serious,Moderate,MildandNormalHforBPvariablegave thesameprobabilityvaluefortheMetSvariable.
Step Instantiated variable
Value MetS=Yes
1 TG = Hyper 48.7%
2 WC = VeryHigh 85.4%
3 HDL = Low 100%
4 BP = NormalH 100%
4 BP = Mild 100%
4 BP = Moderate 100%
4’ BP = Serious 100%
composedofthe fourfollowingvariables:WC,HDL, BPand TG.Wechoosefromeachstepthevariableandthestatethat inducesthegreatestincreaseintheconditionalprobabilityof MetSvariableinaYesstate.AsummaryisshowninTable5and Fig.7.GiventheMarkovblanketoftheMetSvariable,theglobal MarkovpropertystatesthattheMetSvariableisconditionally independentofanyothervariable.
Again,wemaximizetheMetSvariableprobabilityinaNo state.Wechooseateachstatethevariableandthestatethat mostincreasesthe probabilityoftheMetSvariable inaNo state.AsummaryisshowninTable6andFig.8.
4.4.2. MinimizingconditionedentropyforCVLY
WemaximizeCVLYfeatureprobabilityinFirstQuartilestate.To achieveit,weconsidertheMarkovblanketofCVLYvariable,it iscomposedofthesixfollowingvariables:CVRS,Gender,Age, Smoking,BP,andHDL.Wechoosefromeachstepthevariable
Fig.7–Intercausalreasoning:maximizingMetSfeatureintheYesstate.WetrytoobtainthehighestprobabilityforMetS= Yesafterintroducingthefollowingevidence:TG=Hyper,WC=VeryHigh,HDL=Low.
Table6–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheMetSvariable, whereintheinitialBNwithoutevidenceMetS=No reachedaprobabilityof88.3%.Themaximum
probabilityforMetSfeatureinstateNOisachievedwhen BP=Normal,WC=NormalandTG=Normal.
Step Instantiated variable
Value MetS=No
1 BP = Normal 95.4%
2 WC = Normal 99.1%
3 TG = Normal 100%
4 HDL = Normal 100%
4 HDL = Low 100%
4 HDL = High 100%
andthestatethatinducesthegreatestincreaseinthecon- ditionalprobabilityofCVLYvariableinFirstQuartilestate.Age featurehasnotbeenincluded,becauseitdoesnotincrease theprobabilityofCVLYinFirstQuartileonceBP, HDL, Smok- ing, Genderand CVRSfeaturesare instantiated.Asummary isshowninTable7.GiventheMarkovblanketoftheCVLY variable,theglobalMarkovpropertystatesthattheCVLYvari- ableisconditionallyindependentofanyothervariable.
Again,wemaximizeCVLYfeatureprobabilityinaSecond Quartilestate.Theorderoffeaturesis:Smoking,BP,HDL,Gender, CVRS,andAge.Achievingamaximumprobabilityvalueof68%
forCVLYfeatureinSecondQuartilevalue.Asummaryisshown inTable8.
Again, we maximize CVLY feature probability ina Third Quartilestate.Theorderoffeaturesis:BP,HDL,Smoking,Gen- der,CVRS,andAge.Achievingamaximumprobabilityvalue
Table7–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheCVLYvariable, whereintheinitialBNwithoutevidenceCVLY=First Quartilereachedaprobabilityof28.2%.
Step Instantiated variable
Value CVLY=First Quartile
1 BP = Optimal 51.2%
2 HDL = Low 70.7%
3 Smoking = NeverSmoker 90.8%
4 Gender = Women 91.6%
5 CVRS = Low 91.7%
Table8–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheCVLYvariable, whereintheinitialBNwithoutevidenceCVLY=Second Quartilereachedaprobabilityof24.3%.
Step Instantiated variable
Value CVLY=Second Quartile
1 Smoking = NeverSmoker 29.2%
2 BP = Normal 43.3%
3 HDL = Normal 58.3%
4 Gender = Men 64.2%
5 CVRS = Low 65.1%
6 Age = 55–64 68.0%
of79%forCVLYfeatureinThirdQuartilevalue.Asummaryis showninTable9.
Finally,wemaximizeCVLYfeatureprobabilityinaFourth Quartilestate.Theorderoffeaturesis:BP,andSmoking.Achiev- ingamaximumprobabilityvalueof100%forCVLYfeaturein FourthQuartilevalue.AsummaryisshowninTable10.
Fig.8–Intercausalreasoning:maximizingMetSfeatureintheNostate.Wetrytoobtainthehighestprobabilityfor MetS=Noafterintroducingthefollowingevidence:BP=Normal,TG=Normal,WC=Normal.
Table9–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheCVLYvariable, whereintheinitialBNwithoutevidenceCVLY=Third Quartilereachedaprobabilityof25.1%.
Step Instantiated variable
Value CVLY=Third Quartile
1 BP = Normal 34.1%
2 HDL = High 49.2%
3 Smoking = Neversmoker 70.1%
4 Gender = Men 75.5%
5 CVRS = Low 77.6%
6 Age = 45–54 79.0%
Table10–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheCVLYvariable, whereintheinitialBNwithoutevidenceCVLY=Fourth Quartilereachedaprobabilityof22.4%.
Step Instantiated variable
Value CVLY=Fourth Quartile
1 BP = Serious 79.6%
2 Smoking = Currentsmoker 100%
4.4.3. MinimizingconditionedentropyforCVRS
We maximize CVRS feature probability in a Low state. To achieve it, we consider the Markov blanket of CVRS vari- able,it iscomposed ofthethree followingvariables:CVLY, Age,andGender.Wechoosefromeachstepthevariableand the state that induces the greatest increase in the condi- tionalprobabilityofCVRSvariable inaLowstate.Theorder offeaturesis:CVLY,Ageand,Gender.Achievingamaximum probability valueof100% forCRVS feature inLow value. A summary is shown inTable 11. Giventhe Markovblanket oftheCVLYvariable,theglobalMarkovpropertystatesthat theCVRSvariableisconditionallyindependentofanyother variable.
Again,wemaximizeCVRSfeatureprobabilityinaModerate state.Theorderoffeaturesisthesamethatthecasebefore:
CVLY,Ageand,Gender.Achievingamaximumprobabilityvalue of34.4%forCRVS feature inModeratevalue. Asummary is showninTable12.
Finally, we maximize CVRS feature probability ina Low state.Theorderoffeaturesisthesamethatthecasesbefore:
CVLY,Ageand,Gender.Achievingamaximumprobabilityvalue of11.9%forCRVSfeatureinHighvalue.Asummaryisshown inTable13.
Table11–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheCVRSvariablein Lowstate,whereintheinitialBNwithoutevidence CVRS=Lowreachedaprobabilityof91.8%.The maximumconditionedprobabilityforCVRSfeaturein stateLowisachievedwhenCVLY=FirstQuartile, Age=55–64andGender=Men.
Step Instantiated variable
Value CVRS=Low
1 CVLY = FirstQuartile 97.5%
2 Age = 55–64 99.0%
3 Gender = Men 100%
Table12–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheCVRSvariablein Moderatestate,whereintheinitialBNwithoutevidence CVRS=Moderatereachedaprobabilityof6.83%.The maximumconditionedprobabilityforCVRSfeaturein stateModerateisachievedwhenCVLY=FourthQuartile, Age=55–64andGender=Men.
Step Instantiated variable
Value CVRS=
Moderate
1 CVLY = FourthQuartile 19.4%
2 Age = 55–64 30.5%
3 Gender = Men 34.4%
5. Discussion
ThisstudydemonstratesthefeasibilityofBNsinepidemiolog- icalstudies,particularlywhendatafromcardiovascularrisk factorsisconsidered.BNscanbeusedforansweringclinical questionsbasedonunobservedevidencesincetheprobability distributionscanbeautomaticallyupdatedwhennewpatient informationisaddedinanappealingway.
TheBNsallowustoestablishtherelationshipsbetween featuresthrough therelationshipsofdependencyand con- ditional independency.GivenGender, Age, BPand HDLthen CVLYandCVRSfeaturesared-separatedofMetSfeature,any activetrailconnectingthemwasfound.However,considering thelocalMarkovpropertyofanode,e.g.,giventheparentsof CVLYfeature,whichiscomposedofGender,Smoking,BP(blood pressure)andHDL(cholesterol)andtakingintoaccountthe localMarkovconditionCVLYfeatureremainsindependentof itsnondescendants,CVLYfeatureisindependentofallother variables exceptofCVRSfeature,inparticularindependent oftheMetSfeature.Similarly,givenBP,WC,HDL,TGfeatures, thenMetSfeatureisindependentoftheremainingfeatures;in thiscase,astheMetSfeaturedoesnothaveanydescendants, BP,WC,HDL,TGfeaturesconstituteitsMarkovblanket,and theglobalMarkovpropertystatesthattheMetSfeatureiscon- ditionallyindependentofanyotherfeaturegivenitsMarkov blanket.
GiventhestructureofaBN,theuseoftheglobalMarkov property on each feature allows us toestablish the set of features(whichwillbeconstitutedbytheMarkovblanketof thisspecificfeature)withthestrongestinfluenceonthatfea- ture;furthermore,theMarkovblanketofaparticularfeature (nodeintheDAG)canbeusedtofindthecombinationofthe
Table13–Step-by-stepinstantiationsleadingto maximizationoftheprobabilityoftheCVRSvariablein Highstate,whereintheinitialBNwithoutevidence CVRS=Highreachedaprobabilityof1.36%.The maximumconditionedprobabilityforCVRSfeaturein stateHighisachievedwhenCVLY=FourthQuartile, Age=55–64andGender=Men.
Step Instantiated variable
Value CVRS=
High
1 CVLY = FourthQuartile 3.77%
2 Age = 55–64 9.41%
3 Gender = Men 11.9%
differentstatesthat allowtomaximizeor minimize apar- ticularstateofsuchfeature. Inthisstudy wefocusmainly onCVLY,CVRS,andMetSfeatures.However,usingtheBNsa characterizationofthewholesetofvariablescouldbegiven;
e.g.,theMarkovblanketforBMIfeatureisgivenbyphysical activity(PA),gender(Gender),andwaistcircumference(WC), giventhesethreefeatures,BMIfeatureisindependentofthe remainingones;furthermoreitcouldbeusedtofindthecom- binationofstateswhichmaximizeorminimizeaspecificstate ofBMIfeature.
InourBNmodelGenderandBMIareconnected.Anassocia- tion,oralink,betweengenderandBMIhasbeenwidelyshown intheliterature.However,reasonsforBMIgenderdifference areunclear.Differencesinanatomic,physiologic,metabolic andsexhormonalstatusbetweengenderscouldcontributeto thesedifferences.In[63]and[64]fromaSwedishandCana- dianpopulationrespectivelyGenderandBMIappearrelated.
In[65]fromadatasetforepidemiologicalresearchofKorean populationthe authorsbuilda BNforpredictingmetabolic syndrome,Genderappearscompletelyisolated,itisneither relatedtoBMInorrelatedtoanyothervariable(WC,Age,HDL, Cholesterol,etc.).
TheBNmodelincludedBMIandWCfeatures.Themost commonlymethodusedforclassifyinganindividualasover- weightor obese isthe body mass index (BMI). TheBMI is definedasthebodymassdividedbythesquareofthebody height,andisuniversallyexpressedinunitsofkg/m2,result- ingfrommassinkilogramsandheightinmetres.However, theBMIhaslimitationsandcanleadtothemisclassificationof certainindividualssuchasthosewithincreasedmusclemass ortheelderly.Waistcircumference(WC)maybeabetterindi- catorofhealthriskthanBMIalone,especiallywhenusedin combinationwithBMI.WCisparticularlyusefulforindivid- ualswithaBMIof25–34.ForindividualswithaBMIlessthan 35,WCaddslittlepredictivepoweronthediseaseriskclas- sificationofBMI.Resultsobtainedinrecentstudiesreported thatcorrelationsbetweenWC,waist-to-hipratio(WHR)and waist-to-heightratio(WHtR)andcardiovascularriskfactors arebetterthanBMI(seeforinstance[66,67]).
Reasons for the sex difference in CVRS are not fully understood.Differencesinmajorcardiovascularriskfactors, particularlyinHDLcholesterollevel,obesityandsmokingrate, explainedasubstantialpartofthesexdifferenceincardiovas- cularrisk[68,69].
Themaindifferencewithrespectothercardiovascularrisk studies intheliterature[65,70–74] isthatwe includethree diagnosticfeatures:CVLY,CVRS,andMetS.Thisfacthelpsto determinethosefeatureswiththegreatestinfluenceineach ofthediagnosticfeatures.
Insummary,BNs are agraph-based structureofajoint multivariateprobabilitydistribution whichcapturethe way an expert establishes the relationships between variables.
Furthermore, BNs are a powerful tool for modeling the decision-makingprocessunderuncertainty,whichcombinea qualitativeandquantitativerepresentationatthesametime.
Duetosimilar knowledgepattern,aBNnetwork(amodel- ingtool)canserveasaninformalbasisfordevelopmentofa frameworkofDecisionSupportSystem(DSS)intheformof tabularrule-basedsystem[75]formedicalrecommendations (aDSStool).
6. Conclusions
BNs have been chosen in order to produce an intuitive, transparent, graphical representation of the investigated interdependencies. The obtained model helps us to easily identify the relationships of probabilistic causal dependencies andconditionalindependenciesbetweenfeatures.Asaresult, wecanthenvisualizetherelationshipsbetween13features inthedomainofcardiovascularrisk.Inthiscase,duetoCVD ismultifactorial,theapplicationofthiskindofnetworksisof specialinterest,bothfromtheoreticalandpracticalpointof view.
Furthermore, the implemented BN was used to make inferences i.e.,to predictnewscenarioswhenhypothetical information was introduced. Adding evidence like differ- ent CVRF values in the implemented BN may be of great interest in epidemiologicalstudies.To makeaBN analysis threereasoningpatternswereconsidered:causal,evidential andintercausalreasoning.Combiningthereasoningpatterns togetherwithlocalandglobalMarkovpropertiesandthecon- ceptofMarkovblanketsomefeatureswereoptimized.
Acknowledgement
ThisresearchwasfundedbytheSpanishMinistryofScience andInnovation(PI13/01477).
references
[1] D.Koller,N.Friedman,ProbabilisticGraphicalModels:
PrinciplesandTechniques,TheMITPress,Cambridge, MA/London,England,2010.
[2] J.Pearl,Causality:Models,ReasoningandInference, Cambridgeuniversitypress,Cambridge,2000.
[3] P.Larranaga,S.Moral,Probabilisticgraphicalmodelsin artificialintelligence,Appl.SoftComput.11(2011) 1511–1528.
[4] A.Lig ˛eza,P.Fuster-Parra,AND/OR/NOTcausalgraphs–a modelfordiagnosticreasoning,Int.J.Appl.Math.Comput.
Sci.7(1997)185–203.
[5] G.F.Cooper,E.Herskovits,ABayesianmethodforthe inductionofprobabilisticnetworksfromdata,Mach.Learn.
9(1992)309–347.
[6] D.Heckerman,D.Geiger,D.M.Chickering,LearningBayesian networks:thecombinationofknowledgeandstatistical data,Mach.Learn.20(1995)197–243.
[7] F.Liang,J.Zhang,LearningBayesiannetworksfordiscrete data,Comput.Stat.DataAnal.53(2009)865–876.
[8] C.J.Butz,S.Hua,J.Chen,H.Yao,Asimplegraphical approachforunderstandingprobabilisticinferencein Bayesiannetworks,Inf.Sci.179(2009)699–716.
[9] C.Glymour,R.Scheines,P.Spirtes,K.Kelly,Discovering causalstructure,TechnicalreportCMU-PHIL-1,1986.
[10] P.Spirtes,C.Glymour,R.Scheines,Causation,Predictionand Search,AdaptiveComputationandMachineLearning,2nd ed.,TheMITPress,2001.
[11] P.Fuster-Parra,A.García-Mas,F.J.Ponseti,P.Palou,J.Cruz,A Bayesiannetworktodiscoverrelationshipsbetween negativefeaturesinsport:acasestudyofteenplayers,Qual.
Quant.48(2014)1473–1491,http://dx.doi.org/
10.1007/s11135-013-9848-y.
[12] P.P.Fuster-Parra,A.García-Mas,F.J.Ponseti,F.M.Leo,Team performanceandcollectiveefficacyinthedynamic psychologyofcompetitiveteam:aBayesiannetwork analysis,Hum.Mov.Sci.40(2015)98–118,
http://dx.doi.org/10.1016/j.humov.2014.12.005.
[13] J.DeFelipe,P.L.López-Cruz,R.Benavides-Piccione,C.Bielza, P.Larranaga,etal.,Newinsightsintotheclassificationand nomenclatureofcorticalGABAergicinterneurons,Nat.Rev.
Neurosci.14(2013)202–216.
[14] M.B.Sesen,A.E.Nicholson,R.Banares-Alcantara,T.Kadir,M.
Brady,Bayesiannetworksforclinicaldecisionsupportin LungCancerCare,PLOSONE8(2013)e82349,
http://dx.doi.org/10.1371/journal.pone.0082349.
[15] A.Djebbari,J.Quackenbush,SeededBayesiannetworks:
constructinggeneticnetworksfrommicroarraydata,BMC Syst.Biol.(2008)2–57,http://dx.doi.org/10.1186/
1752-0509-2-57.
[16] C.J.Needham,J.R.Bradford,A.J.Bulpitt,etal.,Aprimeron learninginBayesiannetworksforcomputationalbiology, PLoSComput.Biol.3(2007),http://dx.doi.org/10.1371/
journal.pcbi.0030129.
[17] S.J.Lycett,M.J.Ward,F.I.Lewis,etal.,Detectionof mammalianvirulencedeterminantsinhighlypathogenic avianinfluenzaH5N1viruses:multivariateanalysisof publisheddata,J.Virol.83(19)(2009)9901–9910.
[18] A.F.Poon,F.I.Lewis,S.L.Pond,etal.,Evolutionary interactionsbetweenN-linkedglycosylationsitesinthe HIV-1envelope,PLoSComput.Biol.3(1)(2007), http://dx.doi.org/10.1371/journal.pcbi.0030011.
[19] R.Jansen,H.Yu,D.Greenbaum,etal.,ABayesiannetworks approachforpredictingprotein-proteininteractionsfrom genomicdata,Science302(5644)(2003)449–453.
[20] F.I.Lewis,F.Brälisauer,G.J.Gunn,Structurediscoveryin Bayesiannetworks:ananalyticaltoolforanalysing complexanimalhealthdata,Prev.Vet.Med.100(2)(2011) 109–115.
[21] F.I.Lewis,B.J.McCormick,Revealingthecomplexityof healthdeterminantsinresource-poorsettings,Am.J.
Epidemiol.176(11)(2012)1051–1059.
[22] M.Lappenschaar,A.Hommerson,P.J.F.Lucas,J.Lagro,S.
Visscher,MultilevelBayesiannetworksfortheanalysisof hierarchicalhealthcaredata,Artif.Intell.Med.57(2013) 171–183.
[23] P.Antal,G.Fannes,D.Timmerman,Y.Moreau,B.D.Moor, Bayesianapplicationsofbeliefnetworksandmultilayer perceptronsforovariantumorclassificationwithrejection, Artif.Intell.Med.29(2003)29–60.
[24] P.Antal,G.Fannes,D.Timmerman,Y.Moreau,B.D.Moor, UsingliteratureanddatatolearnBayesiannetworksas clinicalmodelsofovariantumors,Artif.Intell.Med.30 (2004)257–281.
[25] T.Charitos,L.C.Gaag,S.Visscher,K.A.M.Schurink,P.J.F.
Lucas,AdynamicBayesiannetworkfordiagnosing ventilator-associatedpneumoniainICUpatients,Expert Syst.Appl.36(2009)1249–1258.
[26] S.M.Maskery,H.Hu,J.Hooke,C.D.Shriver,M.N.Liebman,A Bayesianderivednetworkofbreastpathology
co-occurrence,J.Biomed.Inform.41(2008)242–250.
[27] X.H.Wang,B.Zheng,W.F.Good,J.L.King,Y.H.Chang, Computerassisteddiagnosisofbreastcancerusinga data-drivenBayesianbeliefnetwork,Int.J.Med.Inform.54 (1999)115–126.
[28] J.J.Cabre,F.Martin,B.Costa,J.L.Pinol,J.L.Llor,Y.Ortega, etal.,Metabolicsyndromeasacardiovasculardiseaserisk factor:patientsevaluatedinprimarycare,BMCPublicHealth 8(2008)251,http://dx.doi.org/10.1186/1471-2458-8-251.
[29] S.M.Grundy,J.I.Cleeman,S.R.Daniels,K.A.Donato,R.H.
Eckel,B.A.Franklin,D.J.Gordon,R.M.Krauss,P.J.Savage,S.C.
SmithJr.,J.A.Spertus,F.Costa,Diagnosisandmanagement ofthemetabolicsyndrome:anAmericanHeart
Association/NationalHeart,Lung,andBloodInstitute ScientificStatement,Circulation112(2005)2735–2752.
[30] J.G.Lee,S.Lee,Y.J.Kim,H.K.Jin,B.M.Cho,Y.J.Kim,etal., Multiplebiomarkersandtheirrelativecontributionsto identifyingmetabolicsyndrome,Clin.Chim.Acta408(2009) 50–55.
[31] P.Tauler,M.Bennasar-Veny,J.M.Morales-Asencio,A.A.
Lopez-Gonzalez,T.Vicente-Herrero,J.DePedro-Gomez,V.
Royo,J.Pericas-Beltran,A.Aguilo,Prevalenceofpremorbid metabolicsyndromeinSpanishadultworkersusingIDFand ATPIIIdiagnosticcriteria:relationshipswithcardiovascular riskfactors,PLOSONE9(2)(2014),http://dx.doi.org/
10.1371/journal.pone.0089281.eCollection.
[32] B.VanSteenkiste,T.VanderWeijden,H.E.Stoffers,A.D.
Kester,D.R.Timmermans,R.Grol,Improvingcardiovascular riskmanagement:arandomized,controlledtrialonthe effectofadecisionsupporttoolforpatientsandphysicians, Eur.J.Cardiovasc.Prev.Rehabil.14(1)(2007)44–50.
[33] P.D.Sorlie,D.E.Bild,M.S.Lauer,Cardiovascularepidemiology inachangingworld-challengestoinvestigatorsandthe NationalHeart,Lung,andBloodInstitute,Am.J.Epidemiol.
175(7)(2012)597–601.
[34] M.Franco,U.Bilal,E.Guallar,G.Sanz,A.F.Gómez,V.Fuster, R.Cooper,SystematicreviewofthreedecadesofSpanish cardiovascularepidemiology:improvingtranslationfora futureofprevention,Eur.J.Prev.Cardiol.(2012),
http://dx.doi.org/10.1177/2047487312455314.
[35] J.Marrugat,R.Elosua,H.Marti,Epidemiologyofischaemic heartdiseaseinSpain:estimationofthenumberofcases andtrendsfrom1997to2005,Rev.Esp.Cardiol.55(4)(2002) 337–346.
[36] A.Willis,M.Davies,T.Yates,K.Khunti,Primaryprevention ofcardiovasculardiseaseusingvalidatedriskscores:a systematicreview,J.R.Soc.Med.105(8)(2012)348–356.
[37] F.H.Zimmerman,Cardiovasculardiseaseandriskfactorsin lawenforcementpersonnel:acomprehensivereview, Cardiol.Rev.20(4)(2012)159–166.
[38] R.B.D’Agostino,R.S.Vasan,M.J.Pencina,P.A.Wolf,M.
Cobain,J.M.Massaro,W.B.Kannel,Generalcardiovascular riskprofileforuseinprimarycare:theFraminghamheart study,Circulation117(6)(2008)743–753.
[39] WorldHealthOrganization,Obesity:Preventingand ManagingtheGlobalEpidemic,WHO,Geneva,1998.
[40] A.A.Lopez-Gonzalez,A.Aguilo,M.Frontera,M.
Bennasar-Veny,I.Campos,T.Vicente-Herrero,M.
Tomas-Salva,J.DePedro-Gomez,P.Tauler,Effectivenessof theHeartAgetoolforimprovingmodifiablecardiovascular riskfactorsinaSouthernEuropeanpopulation:a
randomizedtrial,Eur.J.Prev.Cardiol.22(3)(2015)389–396, http://dx.doi.org/10.1177/2047487313518479.
[41] F.V.Jensen,T.D.Nielsen,BayesianNetworksandDecision Graphs,InformationScience&Statistics,Springer,2007.
[42] M.Marfell-Jones,T.Olds,A.Stewart,L.Carter,International StandardsforAnthropometricAssessment,International SocietyfortheAdvancementofKinanthropometry, Potchefstroom,SouthAfrica,2006.
[43] F.Buitrago,L.Canon-Barroso,N.Diaz-Herrera,E.
Cruces-Muro,M.Escobar-Fernandez,J.M.Serrano-Arias, ComparisonoftheREGICORandSCOREfunctionchartsfor classifyingcardiovascularriskandforselectingpatientsfor hypolipidemicorantihypertensivetreatment,Rev.Esp.
Cardiol.60(2007)139–147.
[44] M.R.Cobain,AssessmentHeartAge,2011http://www.
heartagecalculator.com.
[45] A.Soureti,R.Hurling,P.Murray,W.vanMechelen,M.
Cobain,Evaluationofacardiovasculardiseaserisk
assessmenttoolforthepromotionofhealthierlifestyles, Eur.J.Cardiovasc.Prev.Rehabil.17(2010)519–523.
[46] W.Buntine,Aguidetotheliteratureonlearning
probabilisticnetworksfromdata,IEEETrans.Knowl.Data Eng.8(2)(1996)195–210,http://dx.doi.org/10.1109/69.494161.
[47] J.Cheng,R.Greiner,J.Kelly,D.Bell,W.Liu,LearningBayesian networksfromdata:aninformation-theorybasedapproach, Artif.Intell.137(2002)43–90.
[48] L.E.Sucar,M.Martínez-Arroyo,Interactivestructural learningofBayesiannetworks,ExpertSyst.Appl.15(1998) 325–332.
[49] R.W.Robinson,Countingunlabeledacyclicdigraph,in:Little CHC,editor,Lecturenotesinmathematics,622,
CombinatorialmathematicsV,Springer-Verlag,NewYork, 1977,pp.28–43.
[50] R.Daly,Q.Shen,S.Aitken,LearningBayesiannetworks:
approachesandissues,Knowl.Eng.Rev.26(2)(2011)99–157.
[51] D.Margaritis,LearningBayesiannetworkmodelstructure fromdata,2003(PhDThesisofCMU-CS-03-153).
[52] R.Nagarajan,M.Scutari,S.Lèbre,BayesianNetworksinR:
WithApplicationsinSystemsBiology,Springer,2013.
[53] M.Scurati,LearningBayesiannetworkswiththebnlearnR package,J.Stat.Softw.35(3)(2010)1–22.
[54] RDevelopmentCoreTeam,R:ALanguageandEnvironment forStatisticalComputing,in:RFoundationforStatistical Computing,Vienna,Austria,2012,ISBN:3-900051-07-0, http://www.R-project.org/.
[55] S.Hojsgaard,D.Edwards,S.Lauritzen,GraphicalModels withR,Springer,NewYork,2012.
[56] G.Claeskens,N.L.Hjort,ModelSelectionandModel Averaging,CambridgeUniversityPress,Cambridge,2008.
[57] R.E.Neapolitan,LearningBayesianNetworks,PrenticeHall, Inc.,UpperSaddleRiver,NJ,USA,2003.
[58] NorsysSoftwareCorporation,Neticaisatrademarksof NorsysSoftwareCorporation,2012,Retrievedfrom:
http://www.norsys.com,Copyright1995–2012.
[59] Weka,3.6.9:WaikatoEnvironmentforknowledgeAnalysis, TheUniversityofWaikato,Hamilton,NewZealand,2013.
[60] N.Friedman,D.Geiger,M.Goldszmidt,Bayesiannetwork classifiers,Mach.Learn.29(1997)131–163.
[61] J.R.Quinlan,C4.5:ProgramsforMachineLearning,Morgan Kaufman,SanFrancisco,CA,1993.
[62] C.E.Shannon,Amathematicaltheoryofcommunication, BellLabs.Tech.J.27(1948)379–423,http://dx.doi.org/10.1002/
j.1538-7305.1948.tb01338.x.
[63] C.Li,G.Engström,B.Hedblad,S.Calling,G.Berglund,L.
Janzon,SexdifferencesintherelationshipsbetweenBMI,
WHRandincidenceofcardiovasculardisease:a population-basedcohortstudy,Int.J.Obes.30(2006) 1775–1781,http://dx.doi.org/10.1038/sj.ijo.0803339.
[64] D.R.McCreary,Genderandagedifferencesinthe relationshipsbetweenbodymassindexandperceived weight:exploringtheparadox,Int.J.Men’sHealth1(1) (2002)31–42.
[65] H.S.Park,S.B.Cho,Evolutionaryattributeorderingin Bayesiannetworksforpredictingthemetabolicsyndrome, ExpertSyst.Appl.39(2012)4240–4249.
[66] M.Bennasar-Veny,A.A.Lopez-Gonzalez,P.Tauler,M.L.
Cespedes,T.Vicente-Herrero,etal.,Bodyadiposityindex andcardiovascularhealthriskfactorsincaucasians:a comparisonwiththebodymassindexandothers,PLoSONE 8(5)(2013)e63999.
[67] M.B.Snijder,M.Nicolaou,I.G.vanValkengoed,L.M.Brewster, K.Stronks,Newlyproposedbodyadiposityindex(bai)by Bergmanetal.isnotstronglyrelatedtocardiovascular healthrisk,Obesity(SilverSpring)20(2012)1138–1139.
[68] R.G.Baeza,V.Neira,C.Neira,M.Acevedo,Gender
differencesincardiovascularriskbytwodifferentscores:a fiveyearsfollowupanalysisofa1500-patientdatabase,J.
Am.Coll.Cardiol.65(10)(2015)A1502.
[69] A.Lopez-Gonzalez,etal.,Desigualdadessocioeconómicasy diferenciassegúnsexoyedadenlosfactoresderiesgo cardiovascular,GacetaSanitaria29(2015)27–36.
[70] J.Vila-Francés,J.Sanchís,E.Soria-Olivas,A.J.Serrano,Expert systemforpredictingunstableanginabasedonBayesian networks,ExpertSyst.Appl.40(2013)5004–5010.
[71] V.G.Almeida,J.Borba,H.C.Pereira,T.Pereira,C.Correia,M.
Pêgo,J.Cardoso,Cardiovascularriskanalysisbymeansof pulsemorphologyandclusteringmethodologies,Comput.
MethodsProg.Biomed.117(2014)257–266.
[72] Ch.R.Twardy,A.E.Nicholson,K.B.Korb,J.Mcneil, EpidemiologicaldataminingcardiovascularBayesian networks,e-J.HealthInform.1(1)(2006).
[73] S.Paredes,T.Rocha,P.deCarvalho,J.Henriques,M.Harris,J.
Morais,Longtermcardiovascularriskmodels’combination.
Anewapproach,Comput.MethodsProg.Biomed.101(3) (2009)231–242.
[74] A.Elsayad,M.Fakr,Diagnosisofcardiovasculardiseases withBayesianclassifiers,J.Comput.Sci.11(2)(2015) 274–282,http://dx.doi.org/10.3844/jcssp.2015.274.282.
[75] A.Lig ˛eza,G.J.Nalepa,Astudyofmethodologicalissuesin designanddevelopmentofrule-basedsystems:proposalof anewapproach,WiresDataMin.Knowl.1(2)(2011) 117–137.