EWAS –from raw data to results
Jon Bohlin, Senior scientist FHI
Dept of infection control epidemiology and modeling Centre for Fertility and Health
AMR Centre
Course outline
• Part 1: Introduction to epi-genetics and the Illumina Humanmethylation450k platform
• Part 2: Overview of methods for analysis of data from Illumina Humanmethylation450k
Transformation of β
i• βi=max(yi,methy,0)/(max(yi,unmethy,0)+max(yi,methy,0)+α) (performed during QC)
• Mi=log2((max(yi,methy,0)+α)/max(yi,unmethy,0)+α) (logit
transform could make analysis more robust, but values are more difficult to interpret)
• βi=2Mi/(2Mi+1);Mi=log2(βi/(1−βi))
Methods applied in EWAS analyses
• The data is really just a matrix of samples vs. transformed intensity values from hybridized methylation sites
• That is, Illumina Humanmethylation450k data is just a NxK matrix of values between 0 and 1
• We are interested in wheter specific probes present in all or most samples exhibit association with a phenotype (i.e.
trait/disease)
• Comparing the values from two probes present in multiple samples can often be done using a standard t-test
• Scanning multiple probes can be performed using standard
linear regression since between sample probe values are often normally distributed
Other quantitative methods that are used in EWAS analyses
• T-test , limma
• 1-1 regression (robust, GLM, etc.)
• Shrinkage methods (LASSO/RIDGE+variants)
• LASSO+RIDGE<=ELNET
• (CP)PLS/PCA regression
• Dantzig selector
• ++ active research field
• More detailed «time series» analyses of each individual(?)
Setting up regression models
Outcome <- exposure Methylation <- Exposure Outcome <- Methylation
* OLS Models not symmetric
* What is outcome and what is exposure?
* Exposure affects methylation status? (i.e. smoking alters methylation status)
* Metylation status affects outcome? (Ageing)
1000-2000 observations vs. 100’s of thousands predictors implies that at best it’s EXTREMELY low
Many weak effects will drown in noise
Before analyses make sure to correct for:
• Sex
• (blod)cell type (Houseman method)
• Ethnicity (often not performed, but EPIC platform includes specific SNP sites)
• Family background
• Age, if applicable
Correlates within the EWAS matrix
• Plotting of different MDA/PCA components against each other can reveal biases within the dataset
• First components are often related to cell type and sex
• Age may also exhibit a a strong «global» influence
• Smoking is often corrected for due to the strong effect on the methylome
• Adjusting away everything that influences the EWAS matrix may however not be right. Depends on the research question and complexity of correlation structures (DAG?)
NIEHS lowlev NIEHS default
Other Methods for screening
• The GLMNET is EXTREMELY fast, and works well on a standard PC (not too great with categorical variables)
• Can run GLMs, i.e. «Poisson», «multinomial»,
«binomial», even «survival»-models
• Takes whole methylation matrix as explanatory variables in one go!
• RIDGE and LASSO special cases in a continum of methods set with a parameter
• GLMNET methods can be trained for prediction
• SE must be computed...
Pvalues
• Qqplots: Often used with GWAS studies, but may not necessarily be right for EWAS studies…
• CpGs may be biased more so than SNPs
• Bonferroni often used because of robustness
••
More refined analyses
•• Robust regression, with a high breakdown point to reduce influence of outliers
• Amplify effects of multiple outcomes using PCA (rather not MANOVA)
• Handle «correlated» methylation points
• «Bump hunting»
• «Epi-stasis» (?)
Exposed to passive smoking
Cg05575921/AHRR/5p15.3
• 1 no
• 2 sometimes
• 3 daily
Father smokes
• 1 – no
• 2 – yes
• Cg05575921/AHRR
SMOKING
• Very strong effect from mother on child…
• Effect from father could be due to mother being
subjected to «passive» smoking as the same factors are found significant, but weaker when mother don’t smoke
• How many cigarets a day has an influence (above/below) median
• Mother’s mother, no effect found
Gestational age prediction –
presentation of EWAS results
Associations between methylation
and gestational age?
Adjustment for cell type
(Houseman) and sex
…added 20 PC’s
21
Gestation age top (1068), age years bottom (95)
Prediction model trained with MoBa 1 –trying to predict GA in MoBa 2 using PLS-regression:
Coefficients: Value Std. Error t value (Intercept) -28.51931 9.53812 -2.99
tst 1.10042 0.03408 32.29
Treningssett: 800 obs fra MoBa 1, prediksjon 685 obs fra MoBa 2
22
Prediction of childrens’ age (10-17yrs) based on a model trained on gestational age data (800 MoBa samples)
R2=0.46, best model (p<0.05) 23
Prediction of adult-age based on model trained on the children’s age data
R2=0.67, best model 24
Prediction of adults’ age (18-65yrs) based on a
model trained on gestational age data (800 MoBa samples)
R2=0.16, p<0.05 25
Prediction of gestational age based on a model trained on childrens’ age data
R2=0.07, p<0.05 26
Prediction of gestational age based on a model trained on adults’
age data
R2=0.01, beste modell 27
Summary of age predictions
• Childrens age (i.e. 10<years<17) is best predicted using the age predictor based on adults (65>years>18) (and vice versa)
• Gestational age is best predicted based on the predictor for
childrens age (but this time *NOT* vice versa!). The predictions however turn out to be poor
• Gestational age predictor predicts age in adults poorly (vice versa not at all)
• In spite of the fact that the training sets for the children- and adult- age predictors are small these models are better at
predicting each other (as opposed to the gestational age predictor which was trained with more than 1000 samples)
28
Results/impressions from experience…
• Physical measurements seems to be significantly correlated with methylation data
• Data from questionaires difficult due to a number of reasons (missing, «subjective answers»)
• «Nutrition-data» gave no correlations whatsoever, alternatively too weak to discover with the current dataset
• Behavioral/cognitive data difficult to detect…
• Several results have been established...