Stata:
Linear Regression
2h
Hein Stigum
Presentation, data and programs at:
https://www.med.uio.no/helsam/forskning/aktuelt/arrangementer/andre/2021/stata-course-uio.html
INTRODUCTION
DAG
DAG: Gestational age and Birthweight
• Birth weight analysis
– Continuous outcome
– Plots by gestational age – Compare means
– Linear regression
gest age E D
birth weight education C2 C1
sex
Agenda
• Purpose
• Workflow
• Syntax
• Testing assumptions
• Influence
BACKGROUND
Regression idea
250030003500400045005000birth weight (gram)
250 260 270 280 290 300 310
gestational age (days)
residual error,
e
x of effect ,
t coefficien b
covariate
= x
outcome
= y
: model
1
1 0
b b x e y
covariate
= x , x
: cofactors many
with
model y b
0 b
1x
1 b
2x
2 e
Model, measure and assumptions
• Model
(standard)• Association measure
b
1= change in y for one unit increase in x
1• Assumptions
for the standard model1. Independent residuals discuss 2. No interactions test
3. Linear effects test
4. Constant residual variance plot/test
can relax the
assumptions
exposure confounder
• Influence
) , 0 (
2
,
2 1
1
0
b x b x e e N v
b
y
Purpose of regression
• Estimation
– Estimate association between exposure and outcome adjusted for other covariates
– Estimate the effect of smoking on lung cancer
• Prediction
– Use an estimated model to predict the
outcome given covariates in a new dataset
– Predict air pollution by distance from roads
DAGs, bias, precision
Predictive power, model fit, R
2Outcome distributions by exposure
Exposed Unexposed
-3 0 1 4
Outcome
Linear regression
cutoff,
logistic regression
Linear regression or
Log-transform,
Exposed
Unexposed
-3 -2 -1 0 1 2 3
Outcome
Workflow
• DAG
gest age
E D
birth weight education
C2 C1
sex
Confounders: education adjust Risk factors: sex include
• Scatter- and density plots
• Bivariate analysis
• Regression
– Model estimation
– Test of assumptions
• Independent residuals
• No interactions
• Linear effects
• Constant error variance
– Influence
• Influence to outliers
Syntax:
“3 Linear Regression.do”
“Analysis”
Density and scatter plots
Scatter of birth weight by gestational age
Distribution of birth weight for low/high gestational age
8/8/22 H.S. 12
Look for deviations from linearity and outliers
Look for shift in mean,
shift in shape gest<40w gest≥40w
62 370
0 20 00 40 00 60 00 B irt h w ei gh t ( gr )
Bi-variate
• Weight by sex (continuous by binary) – ttest bw, by(sex) t-test
• Weight by education (continuous by categorical-3) – anova bw, by(educ) one way anova
• Weight by gest. age (continuous by continuous) – regress bw gest regression
– ttest bw, by(gest2) cut in 2, t-test
Bi-variate result
Birth weight in gr
Syntax
• Estimation
– regress y x1 x2 linear regression
– regress y c.age i.sex continuous age, categorical sex
– regress y c.age##i.sex main+interaction
• Compare models
– estimates store m1 save model
– estimates table m1 m2 compare coefficients
– estimates stats m1 m2 compare model fit
• Post estimation
– predict res, residuals predict residuals in new “res”
Factor (categorical) variables
• Variable
– educ = 1, 2, 3 for Low, Medium and High education
• Built in
– i.educ use educ=1 as base (reference)
– ib3.educ use educ=3 as base (reference) – help fvvarlist help for factor variables
• Manual “dummies”*
– educ=1 as base, make dummies for 2 and 3 – generate Medium =(educ==2) if educ<.
– generate High =(educ==3) if educ<.
*margins and contrast require i.var notation
Continuous variables
• Variable
– Gestational age ranging from 28 to 42 weeks (mode=40)
• Built in
– c.gest default except for interactions
• Advice
– Do not categorize continuous variables
in a final analysis!
• Loss of power
• Increased measurement error
• Spurious interaction
– Whether exposure, confounder (or outcome)
– Need methods for non-linear effects (polynomials, splines)
Syntax
“Regression analysis”
Model 1: outcome+exposure
regress bw gest crude model
estimates store m1 store model results
Model 2 and 3: Add covariates
Estimate association:
m1 is biased, m2=m3 regress bw gest i.educ sex add covariates
estimates table m1 m2 m3 compare coefs
m3 more precise?
m2: se(gest)=4.3 m3: se(gest)=4.2
Conclusion:
m1 is biased, m2 and m3 are unbiased, but m3 is more precise
INFLUENCE
Measures of influence
WOULD NORMALLY HANDLE ASSUMPTIONS FIRST
Influence idea (different data)
outlier regression without outlier, beta= 17.1 regression with outlier,delta beta= 1beta=0.5 -3.5
0 20 00 40 00 60 00 B irt h w ei gh t ( gr )
250 300 350 400
delta beta*se=-6.8
Measures of influence
• Measure change in:
– Coefficients (beta)
• Delta beta
(scaled by se(coeff))Remove obs 1, see change remove obs 2, see change
-.6-.4-.20.2Influence
1 2 10
Id
One delta-beta per observation
(per covariate,
the exposure)
Syntax:
“Influence of outliers”
370 62
-2 -1 .5 -1 -. 5 0 .5 D fb et a ge st
0 500 1000 1500 2000
Identifier, 1-N Delta-beta for gestational age
8/8/22 H.S. 25
dfbeta(gest) create delta-beta
scatter _dfbeta_1 id plot vs id-variable OBS
variable specific
If obs nr 370 is removed, beta
will change 2 se’s=
2*4.2≈8 gr
Removing outliers
regress bw gest i.educ sex if id!=370 est store drop1
regress bw gest i.educ sex if id!=370 & id!=62 est store drop2
est table full drop1 drop2, b(%8.0f)
Conclusion:
Outlier 370 had a large effect
ASSUMPTIONS
Assumptions of the standard model
1. Independent residuals 2. No interactions
3. Linear effects
4. Constant residual variance
discuss
test in model plot, test
add splines
When will the
birth weight of one child depend on
the birth weight of another?
Dependent residuals?
Siblings, twins
• Dependent residuals
,vce(cluster var) or mixed models
• Interactions
Add interaction term
• Non linear effects
Add polynomial or spline
If violations of assumptions
-1-.50.51
200 220 240 260 280 300 gest
-1012res
• Non-constant variance
Use robust variance estimation
regress y x, robust
INTERACTION
ONLY LINEAR EFFECTS
Interaction definitions
• Interaction: combined effect of two variables
• Scale
– Linear models additive
• y=b
0+b
1x
1+b
2x
2both x
1and x
2= b
1+b
2– Logistic, Poisson, Cox multiplicative
•
both x
1and x
2= OR
1*OR
2• Interaction
– deviation from additivity (or multiplicativity)
– effect of x
1depends on x
2Syntax
“Interaction”
Interaction (only linear effects)
• Add interaction terms
• Show results
regress bw c.gest##i.sex i.educ main + gest-sex interaction
margins, dydx(gest) at(sex=0) effect of gest for boys
margins, dydx(gest) at(sex=1) effect of gest for girls
NON-LINEAR EFFECTS
Smoothers in regressions
• Polynomials
– x, x
2,x
3• Splines
– cubic – linear
• Fractional polynomials
(2 of 8)x
-2, x
-1, x
-0.5log(x), x
0.5x, x
2, x
3c
1c
2estimates only plots
knots
Syntax
“Linear effect”
Cubic spline
• Cubic spline
• Plot
• Test
8/8/22 H.S. 37
mkspline c=gest, cubic nknots(4) make spline with 4 knots (c1,c2,c3)
gen igest=round(gest) integer values of gest margins, over(igest) predicted bw by gest
marginsplot plot
regress bw c1 c2 c3 i.educ sex regression with spline
est stats best cs AIC
est store cs store estimates as cs
better fit
0 20 00 40 00 Li ne ar P re di ct io n
27 29 31 33 35 37 39 41
Predictive Margins with 95% CIs
Cubic spline with given knots
• Cubic spline
8/8/22 H.S. 38
mkspline c=gest, cubic knots(30 32 38 40)
regress bw c1 c2 c3 i.educ sex regression with spline
0 50 00 Li ne ar P re di ct io n
27 29 31 33 35 37 39 41
Predictive Margins with 95% CIs
Better fit
at low gest
Linear spline
• Linear spline
• Plot
(as before)• Test
(as before)mkspline l1 32 l2 38 l3=gest linear spline with knots at 32 and 38 regress bw l1 l2 l3 i.educ sex regression with spline
est store ls store estimates as ls
best fit
Summing up: non-linear effects
• Capture non-linearities in continuous variable
– Categorize, lose precision
– Fractional polynomials or splines are better
• Continuous exposure
– Replace by cubic spline: good fit, only plot – Replace by linear spline: good fit, estimates
• Continuous confounder
– Keep linear (unless non-linear in both exposure and outcome effect)
CONSTANT RESIDUAL
VARIANCE
Test constant residual variance
• Constant variance:
• Plot
• Test
estat hettest rvfplot
plot residual versus predicted (fitted)
some heteroscedasticity or,
compare se-s
with and without “robust”
Syntax :
“Constant residual variance”
Final model
Linear spline model with robust variance estimation:
regress bw g1 g2 g3 i.educ sex, robust est store lsr
Conclusion:
At 27-32 weeks the birth weight increases with 104 gr per week At 32-38 weeks the birth weight increases with 345 gr per week
estimates se
estimates estimates
se
se
Correct model for effect of education
• Interpret other covariate effects from the model?
gest bw
educ
educ confounder
adjust
gest mediator not adjust Exposure:
gest bw
educ
final educMod
gest educ
Conclusion:
Effect of education is misleading in the final model.
Need a separate model for each covariate Table
s
DAGs
Help
• Linear regression
– help regress
• syntax and options
– help regress postestimation
• dfbeta
• estat hettest
• rvfplot
• predict
• margins
– help factor variables
• factor variables and interactions
Summing up 1: Model fitting
• Build model
– regress bw gest crude model
– est store m1 store
– regress bw gest i.educ sex full model – est store m2
– est table m1 m2 compare coefficients
Summing up 2: Assumptions
• Independent residuals
• No interaction
– regress bw3 c.gest##i.sex i.educ test interaction – margins, dydx(gest) at(sex=0) gest for boys
• Linear effects
– mkspline g1 38 g2 linear spline
– regress bw g1 g2 i.sex i.educ estimate splines
• Constant residual variance
– rvfplot residual versus fitted
– regress …, robust robust variance
Summing up 3: Influence of outliers
• Influence
– dfbeta(gest) delta-beta
– scatter _dfbeta_1 id plot versus id
References
• Westreich, D. and S. Greenland (2013). "The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients." American Journal of
Epidemiology 177(4): 292-298.
• Robinson, L. D. and N. P. Jewell (1991). "Some Surprising Results About Covariate Adjustment in Logistic-Regression Models." International Statistical Review 59(2): 227-240.
• Xing, C. and G. A. Xing (2010). "Adjusting for Covariates in Logistic Regression Models." Genetic Epidemiology 34(8): 937-937.
• Royston, P., D. G. Altman and W. Sauerbrei (2006). "Dichotomizing continuous predictors in multiple regression: a bad idea." Stat Med 25(1): 127-141.
• Binder, H., W. Sauerbrei and P. Royston (2013). "Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response." Stat Med 32(13):
2262-2277.
• Govindarajulu, U. S., E. J. Malloy, B. Ganguli, D. Spiegelman and E. A. Eisen (2009). "The Comparison of Alternative Smoothing Methods for Fitting Non- Linear Exposure-Response Relationships with Cox Models in a Simulation Study." International Journal of Biostatistics 5(1).
• Kahan, B. C., H. Rushton, T. P. Morris and R. M. Daniel (2016). "A comparison of methods to adjust for continuous covariates in the analysis of randomised