• No results found

Sparse Data Bias 1h

N/A
N/A
Protected

Academic year: 2022

Share "Sparse Data Bias 1h"

Copied!
21
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Sparse Data Bias 1h

Hein Stigum

Arbeidsrom “Metodelunsj”

https://folkehelse.sharepoint.com/sites/1094

(2)

High impact paper

Greenland S, Mansournia MA,

Altman DG.

2016.

Sparse data bias: A problem hiding in plain

sight.

BMJ-Brit Med J 353.

(3)

Agenda

• Explained

– 2 by 2 table

– Adjusted regression

• Diagnostic

– Penalized regression

• Remedy

– Penalized regression

Sparse Data Bias

(4)

Sparse Data Bias in 2 by 2 table

go to Excel

(5)

Sparse Data Bias in regression

(6)

Simulated data: DAG

OR=2

OR=2

OR=2 OR=2

OR=2

10 medium strong confounders  strong (pos) confounding

Prev=10% Prev=5%

Prev=50%

(7)

Sparse Data Bias in regression with many covariates

go to Stata

Explained example Single simulations

(8)

Radiation and childhood leukemia

• Exposure

– Radiation from nuclear reprocessing (0/1)

• Outcome

– Childhood leukemia (0/1)

• N=24 Estimated OR=57

Discard the study?

Agree on a range:

OR is unlikely to be outside (1/40, 40)

(Greenland, Mansournia et al. 2016)

(9)

Penalized regression

• Many variants:

• Lasso

– Penalized on the sum of the abs()

– for variable selection, interaction selection

• Penalization or approximate Bayesian

– Penalize if a given  is outside a given range

– external (or prior) information to improve accuracy over repeated studies.

– Can be implemented by data augmentation, translating prior distributions into prior-data records. Standard software.

– Stata: findit penlogit

(10)

Penalized regression, approximate Bayes

1. Descide on a plausible range for OR 2. Run penlogit, …

go to Stata

penalized regression

(11)

How bad can it get?

• N=500, True OR=2.0

• Example

– Logistic: OR=1.2

– Penalized (1,5): OR=1.8

• Is this a rare, unlikely result?

– Simulate the N=500 population many times – Make a distribution plot of the estimates ORs

(12)

8/6/22 HS 12

0 .2 .4 .6 .8 1

0 1 2 3 4 5 6 7 8 9 10

Odds Ratios X→Y

Ordinary Logistic

Penalized, OR in 1,5

N=500

1.2 1.8

(13)

0 .2 .4 .6 .8 1

0 1 2 3 4 5 6 7 8 9 10

Ordinary Logistic

Penalized, OR in 1,5

N=1000

Rule of thumb:

1000*5% =50cases

Can handle 5 covariates

8/6/22 HS 13

(14)

8/6/22 HS 14

0 .5 1 1 .5 2

0 1 2 3 4 5 6 7 8 9 10

Odds Ratios X→Y

Ordinary Logistic

Penalized, OR in 1,5

N=10000

(15)

Use Profile Likelihood CIs

penlogit Y X C1-C10, nprior(X ln(2.4) 0.2) or ppl(X)

Allows non-symmetrical CIs

(on the log-OR scale)

N=500, True OR=2.0

(16)

Recommendations

(17)

Avoid Sparse Data Bias, Design

• Do sample size calculations �� ( log ( �� ) ) = √ 1 + 1 + 1 + 1

• Match on some strong confounders (sex and age)

(18)

Handle Sparse Data Bias, Analysis

If number of covariates>10% of cases

or

2*2 table has low cell number

or

Crude and adjusted ORs are very different

try penalized regression

If adjusted and penalized are different

sign of sparse data bias

(19)

Extra

• Penalized regression may also work for

– Full separation

– Collinear variables

(20)

Summing up

• Sparse Data Bias

– occurs in 2*2 tables with low cell counts

– or regression models with many covariates per case – result of “rounding off” and randomness

• Rules of thumb: number of covariates≤

– 10% of the cases (logistic)

– 10% of N (linear regression)

• Penalized regression

– diagnostic tool – solution

(21)

References

Discacciati; A, Orsini; N, Greenland S. 2015. Approximate bayesian logistic regression via penalized likelihood by data augmentation. The Stata Journal 15:712–736.

Greenland S, Mansournia MA, Altman DG. 2016. Sparse data bias: A problem hiding in plain sight. Bmj-Brit Med J 353.

Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J R Statist Soc 58:267-288.

Referanser

RELATERTE DOKUMENTER

Lineage-based data governance and access control, over a big data ecosystem with many different components, facilitated through the combination of Apache Atlas (Apache

The results can be compared to noise suppression obtained for data from the permanently installed magnetometers at Herdla in the same time period, reference (2) and (5).. It

B ) Review  the  sources  of  bias  and  establish  general  parameters  (indicators)/procedures  to  assess  the  bias  on  national  level  of  biological 

In the first part, we evaluate the necessity of using sparse representation in a 4D seismic data assimilation problem by comparing the assimilation performance resulting from using

He used simulated data that mimicked focal sam- pling data collection, a female bias social phenotype (producing a higher average weighted degree among females than

Because our previous simulations (Elenius et al., 2016) based on the storage atlas dataset (Halland et al., 2014) revealed a very large range in CO 2 storage capacity depending on

No blinding Incomplete outcome data (attrition bias) Unclear risk Not specified Selective reporting (reporting bias) Low risk All data reported.. Other bias

In this paper, we have presented a new multiscale surface representation and a rendering algorithm able to reproduce view-dependent effects of detailed geometry accounting