• No results found

EWAS –from raw data to results

N/A
N/A
Protected

Academic year: 2022

Share "EWAS –from raw data to results"

Copied!
30
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

EWAS –from raw data to results

Jon Bohlin, Senior scientist FHI

Dept of infection control epidemiology and modeling Centre for Fertility and Health

AMR Centre

(2)

Course outline

• Part 1: Introduction to epi-genetics and the Illumina Humanmethylation450k platform

Part 2: Overview of methods for analysis of data from Illumina Humanmethylation450k

(3)

Transformation of β

i

• βi=max(yi,methy,0)/(max(yi,unmethy,0)+max(yi,methy,0)+α) (performed during QC)

• Mi=log2((max(yi,methy,0)+α)/max(yi,unmethy,0)+α) (logit

transform could make analysis more robust, but values are more difficult to interpret)

• βi=2Mi/(2Mi+1);Mi=log2i/(1−βi))

(4)

Methods applied in EWAS analyses

The data is really just a matrix of samples vs. transformed intensity values from hybridized methylation sites

That is, Illumina Humanmethylation450k data is just a NxK matrix of values between 0 and 1

We are interested in wheter specific probes present in all or most samples exhibit association with a phenotype (i.e.

trait/disease)

Comparing the values from two probes present in multiple samples can often be done using a standard t-test

Scanning multiple probes can be performed using standard

linear regression since between sample probe values are often normally distributed

(5)

Other quantitative methods that are used in EWAS analyses

• T-test , limma

• 1-1 regression (robust, GLM, etc.)

• Shrinkage methods (LASSO/RIDGE+variants)

• LASSO+RIDGE<=ELNET

• (CP)PLS/PCA regression

• Dantzig selector

• ++ active research field

• More detailed «time series» analyses of each individual(?)

(6)

Setting up regression models

Outcome <- exposure Methylation <- Exposure Outcome <- Methylation

* OLS Models not symmetric

* What is outcome and what is exposure?

* Exposure affects methylation status? (i.e. smoking alters methylation status)

* Metylation status affects outcome? (Ageing)

1000-2000 observations vs. 100’s of thousands predictors implies that at best it’s EXTREMELY low

Many weak effects will drown in noise

(7)

Before analyses make sure to correct for:

• Sex

• (blod)cell type (Houseman method)

• Ethnicity (often not performed, but EPIC platform includes specific SNP sites)

• Family background

• Age, if applicable

(8)

Correlates within the EWAS matrix

• Plotting of different MDA/PCA components against each other can reveal biases within the dataset

• First components are often related to cell type and sex

• Age may also exhibit a a strong «global» influence

• Smoking is often corrected for due to the strong effect on the methylome

• Adjusting away everything that influences the EWAS matrix may however not be right. Depends on the research question and complexity of correlation structures (DAG?)

(9)

NIEHS lowlev NIEHS default

(10)

Other Methods for screening

• The GLMNET is EXTREMELY fast, and works well on a standard PC (not too great with categorical variables)

• Can run GLMs, i.e. «Poisson», «multinomial»,

«binomial», even «survival»-models

• Takes whole methylation matrix as explanatory variables in one go!

• RIDGE and LASSO special cases in a continum of methods set with a parameter

• GLMNET methods can be trained for prediction

• SE must be computed...

(11)

Pvalues

Qqplots: Often used with GWAS studies, but may not necessarily be right for EWAS studies…

CpGs may be biased more so than SNPs

Bonferroni often used because of robustness

(12)
(13)

More refined analyses

•• Robust regression, with a high breakdown point to reduce influence of outliers

• Amplify effects of multiple outcomes using PCA (rather not MANOVA)

• Handle «correlated» methylation points

• «Bump hunting»

• «Epi-stasis» (?)

(14)

Exposed to passive smoking

Cg05575921/AHRR/5p15.3

• 1 no

• 2 sometimes

• 3 daily

(15)

Father smokes

• 1 – no

• 2 – yes

• Cg05575921/AHRR

(16)

SMOKING

• Very strong effect from mother on child…

• Effect from father could be due to mother being

subjected to «passive» smoking as the same factors are found significant, but weaker when mother don’t smoke

• How many cigarets a day has an influence (above/below) median

• Mother’s mother, no effect found

(17)

Gestational age prediction –

presentation of EWAS results

(18)

Associations between methylation

and gestational age?

(19)

Adjustment for cell type

(Houseman) and sex

(20)

…added 20 PC’s

(21)

21

Gestation age top (1068), age years bottom (95)

(22)

Prediction model trained with MoBa 1 –trying to predict GA in MoBa 2 using PLS-regression:

Coefficients: Value Std. Error t value (Intercept) -28.51931 9.53812 -2.99

tst 1.10042 0.03408 32.29

Treningssett: 800 obs fra MoBa 1, prediksjon 685 obs fra MoBa 2

22

(23)

Prediction of childrens’ age (10-17yrs) based on a model trained on gestational age data (800 MoBa samples)

R2=0.46, best model (p<0.05) 23

(24)

Prediction of adult-age based on model trained on the children’s age data

R2=0.67, best model 24

(25)

Prediction of adults’ age (18-65yrs) based on a

model trained on gestational age data (800 MoBa samples)

R2=0.16, p<0.05 25

(26)

Prediction of gestational age based on a model trained on childrens’ age data

R2=0.07, p<0.05 26

(27)

Prediction of gestational age based on a model trained on adults’

age data

R2=0.01, beste modell 27

(28)

Summary of age predictions

Childrens age (i.e. 10<years<17) is best predicted using the age predictor based on adults (65>years>18) (and vice versa)

Gestational age is best predicted based on the predictor for

childrens age (but this time *NOT* vice versa!). The predictions however turn out to be poor

Gestational age predictor predicts age in adults poorly (vice versa not at all)

In spite of the fact that the training sets for the children- and adult- age predictors are small these models are better at

predicting each other (as opposed to the gestational age predictor which was trained with more than 1000 samples)

28

(29)

Results/impressions from experience…

• Physical measurements seems to be significantly correlated with methylation data

• Data from questionaires difficult due to a number of reasons (missing, «subjective answers»)

• «Nutrition-data» gave no correlations whatsoever, alternatively too weak to discover with the current dataset

• Behavioral/cognitive data difficult to detect…

• Several results have been established...

(30)

Even more refined analyses…

Referanser

RELATERTE DOKUMENTER

This has to be done by using ArcGIS, with example of Roadnodes.mxd (pdfroadnodes.shp). From the ArcGIS file you can prepare the “Road nodes” sheet in the AirQUIS Traffic Data,

Table 7 gives the sensitivity and specificity of using the pre-bronchodilator values and the pre-bronchodilator values adjusted with a fixed constant when compared with the

In order to evaluate the performance of alternative imputation methods on data sets that do include missing values, clustering can be done based on data obtained using the

Pre-classification can be implemented in a pre-processing step by using the CPU to transform the scalar volume data into a RGBA texture containing the colors and alpha values from

PFLÜGER H., HÖFERLIN B., RASCHKE M., ERTL T.; Simulating fixations when looking at visual arts. Journal; ACM Transactions on Applied Perception; accepted

The analysis of Multivariate Networks (MVNs) can be approached from two different perspectives: a multidimensional one, consisting of the nodes and their multiple attributes, or

Guidelines: A guideline in VisGuides, can be a single thread, span across multiple threads, or can be the result from a discussion that is based on an existing thread (Figure

Moreover, the search for lines close to a given pixel can be done efficiently by using this GPU-based vector map data structure, and the pixel can be colored based on the distance