Stata:
Logistic Regression
3 h
Hein Stigum
Presentation, data and programs at:
https://www.med.uio.no/helsam/forskning/aktuelt/arrangementer/andre/2022/stata-course-uio.html
DAG: Physical activity and CHD
• CHD analysis
– Binary outcome
– Plots by physical activity – Compare proportions
– Logistic regression
Agenda
• Purpose
• Workflow
• Syntax
• Testing assumptions
• Influence of outliers
BACKGROUND
Logistic model and assumptions
• Logistic model
• Assumptions of the standard model
– Independent residuals
– Linear effects (on the log-odds scale)
– No interactions
•
linear predictor: xb
Association measure, Odds ratio
Model:
Start with:
Hence:
Short: need to know
• Binary outcome
• Assume
– Linear effects on the log-odds scale
• Association measure
– OR=e b , b=coefficient
• Scale
– Multiplicative exposed to both x
1and x
2: OR
1*OR
2Purpose of regression
• Estimation
– Estimate effect of exposure on outcome adjusted for other covariates
– Estimate the effect of smoking on lung cancer
• Prediction
– Predict outcome by exposures
1. Estimate model (air pollution and distance from roads)
2. Predict air pollution in a new dataset using distance from roads
DAGs, bias, precision
Predictive power, model fit, R
2Syntax
• Estimation
– logistic y x1 x2 logistic regression
– logistic y i.smoke c.age cat. smoke, cont. age – logistic y i.smoke##c.age interaction, 3 terms
• Manage models
– estimates store m1 save model – est table m1, eform show OR
• Post estimation
– predict yf, pr predict probability
– margins, over(ageI) predict(xb) linearity on the log odds scale
Workflow
• DAG
Confounders: age, educ adjust Risk factors: sex include*
* sex specific estimate otherwise population estimate
• Bivariate analysis
• Regression
– Model fitting
• Exposure
• + Confounders
– Test of assumptions
• Independent errors
• Linear effects (on the log odds scale)
• Interactions
– Influence of outliers
(Daniel et al. 2020)
Syntax
“Descriptive Analysis”
Physical activity and CHD, example
21 pp lower risk
0.22 times the risk 0.17 times the odds
Syntax
“Regression Analysis”
ASSUMPTIONS
Assumptions of the standard model
1. Independent residuals
2. Linear effects on the log-odds scale
3. No interactions
discuss
add interactions
add splines
When will the
heart disease of one person depend on the
heart disease of another?
Dependent residuals?
Siblings, twins
logistic …, vce(cluster(m_id)) If many siblings:
clusters by mother’s id
or use mixed models
Non-linear effects
Smoothers in regressions
• Polynomials
– x, x 2, x 3
• Splines
– cubic – linear
• Fractional polynomials (2 of 8)
x -2 , x -1 , x -0.5 log(x), x 0.5 x, x 2 , x 3
c
1c
2estimates only plots
knots y
x
y
x
Polynomials: global
Syntax
“Non-linear effect”
INTERACTION
Effect modification
Interaction
• Interaction: combined effect of two variables
• Example
y= b
0+b
1x+b
2sex effect of x does not depend on sex
y= b
0+b
1x+b
2sex+ b
3x∙sex effect of x depends on sex (interaction)
• Test
– Interaction if b
3≠0
• Scale
– Linear models additive
– Logistic, Poisson, Cox multiplicative – Interaction is scale dependent
• No interaction on the additive scale implies interaction on other scales
Interaction
Is the effect of physical activity on heart disease the same for low and high education?
Syntax:
logistic chd c.age c.phys##i.educ Terms:
… c.phys i.educ c.phys#i.educ
main effect interaction effect
_b[phys] +0* _b[1.educ#c.phys] ) exp(
Effect of physical activity for low and high education:
educ=0 _b[phys] +1* _b[1.educ#c.phys] )
exp( educ=1
Syntax
“Interaction”
INFLUENCE
Measures of influence of outliers
Measures of influence
• Measure change in:
– Coefficients (beta)
• Delta beta
Remove obs 1, see change remove obs 2, see change
-. 6 -. 4 -. 2 0 .2 In flu en ce
1 2 10
Id
One delta-beta per observations
(with same covariate pattern)
for all covariates
Syntax
“Influence”
MARGINS
Predictions from the model
Margins
• Helpful to predict the probability of the outcome over exposure.
• "margins" handles interactions and non- linearities
• "margins" can be followed by "marginsplot“
Predicting from the model does not make this a “prediction model”.
Our modeling strategy using DAGs make this an estimation model.
Margins, Examples
• Model
mkspline cs=age, cubic nk(4) 3 splines: cs?
logistic chd c.phys##c.cs? i.educ sex phys*cs?
• Margins examples
margins overall risk
margins educ risk by educ (cat)
margins, at(sex=(0 1)) risk by sex (cat or cont) margins, at(phys=(3 6 9 12)) table of risks
• Conditional vs marginal
logistic y x a b c model
margins, dydx(x) at(a=1) effect of x on y, conditional on a,
marginal over b and c
Margins plot
• Pr(CHD) by age for low and high phys
margins, at(phys=(1 15)) over(ageI) integer age
marginsplot, xdim(ageI) x-dimension=age
Syntax
“Margins”
Summing up 1
• Build model
logistic chd phys crude model
est store m1 store
logistic chd phys age educ full model
est store m2 store
est table m1 m2, eform compare ORs
• Non-linearity (cubic spline)
mkspline cs=phys, cubic nk(4) spline in phys: cs1, cs2, cs3 logistic chd cs? age educ regression with spline
margins, over(physI) predict(xb) predict on log-odds scale
marginsplot
Summing up 2
• Interaction
– logistic chd c.phys##i.sex test interaction
• Influence of outliers
– predict dBeta, db delta beta (common)
– scatter dBeta p, jitter(10) delta-beta by p, pr(outcome)
• Predictions from the model
– margins educ, at(phys=(1(1)15))
– marginsplot
References
• Binder H, Sauerbrei W, Royston P. 2013. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: A simulation study with continuous response. Stat Med 32:2262-2277.
• Daniel R, Zhang J, Farewell D. 2020. Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets. Biom J.
• Govindarajulu US, Malloy EJ, Ganguli B, Spiegelman D, Eisen EA. 2009. The comparison of alternative smoothing methods for fitting non-linear exposure-response relationships with cox models in a simulation study. Int J Biostat 5.
• Kahan BC, Rushton H, Morris TP, Daniel RM. 2016. A comparison of methods to adjust for continuous covariates in the analysis of randomised trials. BMC medical research methodology 16.
• Pregibon D. 1981. Logistic regression diagnostics. The Annals of Statistics 9: 705-724.
• Robinson LD, Jewell NP. 1991. Some surprising results about covariate adjustment in logistic-regression models. Int Stat Rev 59:227-240.
EXTRA SLIDES
Generalized Linear Models, GLM
250030003500400045005000birth weight (gram)
250 270 290 310
gestational age (days)
0.2.4.6.81risk
0 20 40 60 80
age
Linear regression
Logistic regression
Poisson regression
51015