• No results found

Stata: Linear Regression

N/A
N/A
Protected

Academic year: 2022

Share "Stata: Linear Regression"

Copied!
54
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Stata:

Linear Regression

2h

Hein Stigum

Presentation, data and programs at:

https://www.med.uio.no/helsam/forskning/aktuelt/arrangementer/andre/2021/stata-course-uio.html

(2)

INTRODUCTION

DAG

(3)

DAG: Gestational age and Birthweight

• Birth weight analysis

– Continuous outcome

– Plots by gestational age – Compare means

– Linear regression

gest age E D

birth weight education C2 C1

sex

(4)

Agenda

• Purpose

• Workflow

• Syntax

• Testing assumptions

• Influence

(5)

BACKGROUND

(6)

Regression idea

250030003500400045005000birth weight (gram)

250 260 270 280 290 300 310

gestational age (days)

residual error,

e

x of effect ,

t coefficien b

covariate

= x

outcome

= y

: model

1

1 0

b b x e y

covariate

= x , x

: cofactors many

with

model yb

0

b

1

x

1

b

2

x

2

e

(7)

Model, measure and assumptions

• Model

(standard)

• Association measure

b

1

= change in y for one unit increase in x

1

• Assumptions

for the standard model

1. Independent residuals discuss 2. No interactions test

3. Linear effects test

4. Constant residual variance plot/test

can relax the

assumptions

exposure confounder

• Influence

) , 0 (

2

,

2 1

1

0

b x b x e e N v

b

y     

(8)

Purpose of regression

• Estimation

– Estimate association between exposure and outcome adjusted for other covariates

Estimate the effect of smoking on lung cancer

• Prediction

– Use an estimated model to predict the

outcome given covariates in a new dataset

Predict air pollution by distance from roads

DAGs, bias, precision

Predictive power, model fit, R

2

(9)

Outcome distributions by exposure

Exposed Unexposed

-3 0 1 4

Outcome

Linear regression

cutoff,

logistic regression

Linear regression or

Log-transform,

Exposed

Unexposed

-3 -2 -1 0 1 2 3

Outcome

(10)

Workflow

• DAG

gest age

E D

birth weight education

C2 C1

sex

Confounders: education  adjust Risk factors: sex  include

• Scatter- and density plots

• Bivariate analysis

• Regression

– Model estimation

– Test of assumptions

• Independent residuals

• No interactions

• Linear effects

• Constant error variance

– Influence

• Influence to outliers

(11)

Syntax:

“3 Linear Regression.do”

“Analysis”

(12)

Density and scatter plots

Scatter of birth weight by gestational age

Distribution of birth weight for low/high gestational age

8/8/22 H.S. 12

Look for deviations from linearity and outliers

Look for shift in mean,

shift in shape gest<40w gest≥40w

62 370

0 20 00 40 00 60 00 B irt h w ei gh t ( gr )

(13)

Bi-variate

• Weight by sex (continuous by binary) – ttest bw, by(sex) t-test

• Weight by education (continuous by categorical-3) – anova bw, by(educ) one way anova

• Weight by gest. age (continuous by continuous) – regress bw gest regression

– ttest bw, by(gest2) cut in 2, t-test

(14)

Bi-variate result

Birth weight in gr

(15)

Syntax

• Estimation

– regress y x1 x2 linear regression

– regress y c.age i.sex continuous age, categorical sex

– regress y c.age##i.sex main+interaction

• Compare models

– estimates store m1 save model

– estimates table m1 m2 compare coefficients

– estimates stats m1 m2 compare model fit

• Post estimation

– predict res, residuals predict residuals in new “res”

(16)

Factor (categorical) variables

• Variable

– educ = 1, 2, 3 for Low, Medium and High education

• Built in

– i.educ use educ=1 as base (reference)

– ib3.educ use educ=3 as base (reference) – help fvvarlist help for factor variables

• Manual “dummies”*

– educ=1 as base, make dummies for 2 and 3 – generate Medium =(educ==2) if educ<.

– generate High =(educ==3) if educ<.

*margins and contrast require i.var notation

(17)

Continuous variables

• Variable

– Gestational age ranging from 28 to 42 weeks (mode=40)

• Built in

– c.gest default except for interactions

• Advice

– Do not categorize continuous variables

in a final analysis

!

• Loss of power

• Increased measurement error

• Spurious interaction

– Whether exposure, confounder (or outcome)

– Need methods for non-linear effects (polynomials, splines)

(18)

Syntax

“Regression analysis”

(19)

Model 1: outcome+exposure

regress bw gest crude model

estimates store m1 store model results

(20)

Model 2 and 3: Add covariates

Estimate association:

m1 is biased, m2=m3 regress bw gest i.educ sex add covariates

estimates table m1 m2 m3 compare coefs

m3 more precise?

m2: se(gest)=4.3 m3: se(gest)=4.2

Conclusion:

m1 is biased, m2 and m3 are unbiased, but m3 is more precise

(21)

INFLUENCE

Measures of influence

WOULD NORMALLY HANDLE ASSUMPTIONS FIRST

(22)

Influence idea (different data)

outlier regression without outlier, beta= 17.1 regression with outlier,delta beta= 1beta=0.5 -3.5

0 20 00 40 00 60 00 B irt h w ei gh t ( gr )

250 300 350 400

delta beta*se=-6.8

(23)

Measures of influence

• Measure change in:

– Coefficients (beta)

• Delta beta

(scaled by se(coeff))

Remove obs 1, see change remove obs 2, see change

-.6-.4-.20.2Influence

1 2 10

Id

One delta-beta per observation

(per covariate,

the exposure)

(24)

Syntax:

“Influence of outliers”

(25)

370 62

-2 -1 .5 -1 -. 5 0 .5 D fb et a ge st

0 500 1000 1500 2000

Identifier, 1-N Delta-beta for gestational age

8/8/22 H.S. 25

dfbeta(gest) create delta-beta

scatter _dfbeta_1 id plot vs id-variable OBS

variable specific

If obs nr 370 is removed, beta

will change 2 se’s=

2*4.2≈8 gr

(26)

Removing outliers

regress bw gest i.educ sex if id!=370 est store drop1

regress bw gest i.educ sex if id!=370 & id!=62 est store drop2

est table full drop1 drop2, b(%8.0f)

Conclusion:

Outlier 370 had a large effect

(27)

ASSUMPTIONS

(28)

Assumptions of the standard model

1. Independent residuals 2. No interactions

3. Linear effects

4. Constant residual variance

discuss

test in model plot, test

add splines

When will the

birth weight of one child depend on

the birth weight of another?

Dependent residuals?

Siblings, twins

(29)

• Dependent residuals

,vce(cluster var) or mixed models

• Interactions

Add interaction term

• Non linear effects

Add polynomial or spline

If violations of assumptions

-1-.50.51

200 220 240 260 280 300 gest

-1012res

• Non-constant variance

Use robust variance estimation

regress y x, robust

(30)

INTERACTION

ONLY LINEAR EFFECTS

(31)

Interaction definitions

• Interaction: combined effect of two variables

• Scale

– Linear models additive

• y=b

0

+b

1

x

1

+b

2

x

2

both x

1

and x

2

= b

1

+b

2

– Logistic, Poisson, Cox multiplicative

both x

1

and x

2

= OR

1

*OR

2

• Interaction

– deviation from additivity (or multiplicativity)

– effect of x

1

depends on x

2

(32)

Syntax

“Interaction”

(33)

Interaction (only linear effects)

• Add interaction terms

• Show results

regress bw c.gest##i.sex i.educ main + gest-sex interaction

margins, dydx(gest) at(sex=0) effect of gest for boys

margins, dydx(gest) at(sex=1) effect of gest for girls

(34)

NON-LINEAR EFFECTS

(35)

Smoothers in regressions

• Polynomials

– x, x

2,

x

3

• Splines

– cubic – linear

• Fractional polynomials

(2 of 8)

x

-2

, x

-1

, x

-0.5

log(x), x

0.5

x, x

2

, x

3

c

1

c

2

estimates only plots

knots

(36)

Syntax

“Linear effect”

(37)

Cubic spline

• Cubic spline

• Plot

• Test

8/8/22 H.S. 37

mkspline c=gest, cubic nknots(4) make spline with 4 knots (c1,c2,c3)

gen igest=round(gest) integer values of gest margins, over(igest) predicted bw by gest

marginsplot plot

regress bw c1 c2 c3 i.educ sex regression with spline

est stats best cs AIC

est store cs store estimates as cs

better fit

0 20 00 40 00 Li ne ar P re di ct io n

27 29 31 33 35 37 39 41

Predictive Margins with 95% CIs

(38)

Cubic spline with given knots

• Cubic spline

8/8/22 H.S. 38

mkspline c=gest, cubic knots(30 32 38 40)

regress bw c1 c2 c3 i.educ sex regression with spline

0 50 00 Li ne ar P re di ct io n

27 29 31 33 35 37 39 41

Predictive Margins with 95% CIs

Better fit

at low gest

(39)

Linear spline

• Linear spline

• Plot

(as before)

• Test

(as before)

mkspline l1 32 l2 38 l3=gest linear spline with knots at 32 and 38 regress bw l1 l2 l3 i.educ sex regression with spline

est store ls store estimates as ls

best fit

(40)

Summing up: non-linear effects

• Capture non-linearities in continuous variable

– Categorize, lose precision

– Fractional polynomials or splines are better

• Continuous exposure

– Replace by cubic spline: good fit, only plot – Replace by linear spline: good fit, estimates

• Continuous confounder

– Keep linear (unless non-linear in both exposure and outcome effect)

(41)

CONSTANT RESIDUAL

VARIANCE

(42)

Test constant residual variance

• Constant variance:

• Plot

• Test

estat hettest rvfplot

plot residual versus predicted (fitted)

some heteroscedasticity or,

compare se-s

with and without “robust”

(43)

Syntax :

“Constant residual variance”

(44)

Final model

Linear spline model with robust variance estimation:

regress bw g1 g2 g3 i.educ sex, robust est store lsr

Conclusion:

At 27-32 weeks the birth weight increases with 104 gr per week At 32-38 weeks the birth weight increases with 345 gr per week

estimates se

estimates estimates

se

se

(45)

Correct model for effect of education

• Interpret other covariate effects from the model?

gest bw

educ

educ confounder

adjust

gest mediator not adjust Exposure:

gest bw

educ

final educMod

gest educ

Conclusion:

Effect of education is misleading in the final model.

Need a separate model for each covariate Table

s

DAGs

(46)

Help

• Linear regression

– help regress

• syntax and options

– help regress postestimation

• dfbeta

• estat hettest

• rvfplot

• predict

• margins

– help factor variables

• factor variables and interactions

(47)

Summing up 1: Model fitting

• Build model

– regress bw gest crude model

– est store m1 store

– regress bw gest i.educ sex full model – est store m2

– est table m1 m2 compare coefficients

(48)

Summing up 2: Assumptions

• Independent residuals

• No interaction

– regress bw3 c.gest##i.sex i.educ test interaction – margins, dydx(gest) at(sex=0) gest for boys

• Linear effects

– mkspline g1 38 g2 linear spline

– regress bw g1 g2 i.sex i.educ estimate splines

• Constant residual variance

– rvfplot residual versus fitted

– regress …, robust robust variance

(49)

Summing up 3: Influence of outliers

• Influence

– dfbeta(gest) delta-beta

– scatter _dfbeta_1 id plot versus id

(50)

References

• Westreich, D. and S. Greenland (2013). "The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients." American Journal of

Epidemiology 177(4): 292-298.

• Robinson, L. D. and N. P. Jewell (1991). "Some Surprising Results About Covariate Adjustment in Logistic-Regression Models." International Statistical Review 59(2): 227-240.

• Xing, C. and G. A. Xing (2010). "Adjusting for Covariates in Logistic Regression Models." Genetic Epidemiology 34(8): 937-937.

• Royston, P., D. G. Altman and W. Sauerbrei (2006). "Dichotomizing continuous predictors in multiple regression: a bad idea." Stat Med 25(1): 127-141.

• Binder, H., W. Sauerbrei and P. Royston (2013). "Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response." Stat Med 32(13):

2262-2277.

• Govindarajulu, U. S., E. J. Malloy, B. Ganguli, D. Spiegelman and E. A. Eisen (2009). "The Comparison of Alternative Smoothing Methods for Fitting Non- Linear Exposure-Response Relationships with Cox Models in a Simulation Study." International Journal of Biostatistics 5(1).

• Kahan, B. C., H. Rushton, T. P. Morris and R. M. Daniel (2016). "A comparison of methods to adjust for continuous covariates in the analysis of randomised

(51)

EXTRA MATERIAL

(52)

Test deviance from linearity

regress y x

1

x

2

linear term

estimates store lin

regress y f(x

1

) x

2

smoother term

estimates store smo f(x

1

)=poly or spline

estimates table lin smo LR-test or AIC

(53)

Table 1

Outcome: Birth weight

Exposure: Gestational age

Covariates:

(54)

Table 2

Do not show coefficients from cofactors,

they may be misleading

Referanser

RELATERTE DOKUMENTER