EWAS –from raw data to results

(1)

EWAS –from raw data to results

Jon Bohlin, Senior scientist FHI

Dept of infection control epidemiology and modeling Centre for Fertility and Health

AMR Centre

(2)

Course outline

• Part 1: Introduction to epi-genetics and the Illumina Humanmethylation450k platform

• Part 2: Overview of methods for analysis of data from Illumina Humanmethylation450k

(3)

Methylation analyses

• Epi-genetics has become popular since it provides

insight into how the environment may influence gene regulation.

• One epi-genetic method is that of methylation. There are several different types of methylation but we will here only focus on one (5-methylcytosine)

• Somewhat simplified we might say that methylation reduces gene expression

• Genomic methylation patterns change throughout the life course

• The technology is new but evolves quickly

(4)

• Cytosine

• 5-methylcytosine

•

(5)

(6)

Illumina 450k nomenclature

• One observation (1 sample)=one array (450k methylation sites)

• 12 observations on 1 slide

• 1 plate max 8 slides (96 arrays)

(7)

Methylation platforms

• Illumina and NimbleGen are based on «microarray»- technology

• Oligomers (like GWAS) and bisulfit converted counterparts attached to respective beads

• Hybridisation with a methylated base=green light

• No hybridisation gives red light

• Intensities vary for both lights

• Intensities are converted to M –methylated and U –unmethylated, and subsequently to «beta», 0<=beta<=1

• beta is a ratio of methylated to unmethylated

(8)

Type I and II probes

• Type I probes: stronger signal due to distinct probes for methylated/unmethylated sites. Good in low density

CpG regions since it assumes same methylation status for adjacent CpG sites. One color.

• Type II probes: weaker signal and only one probe for methylated/unmethylated. Better in high density CpG regions as it can contain up to 3 underlying CpGs. Two colors.

(9)

Illumina beadchip dataset

(10)

Workflow –from start to finish

• Quality control

- Removal of bad samples - Removal of bad probes

- Removal of SNP based probes

- Removal of inserted control probes - Removal of gender-issues

• Normalization

- Correct for technical bias (i.e. between plates / batches)

- Correct for technology-specific features (type I/type II bias)

• Analysis

- Identify DMR’s

- Correct for known biases in data - Correct for cell-type (cord blod)

(11)

Quality control (QC) –probe selection

• Remove probes with many missing (i.e. >10%)

• Remove observations with bad probes (detection p- value >0.01)

• «gender outliers» and duplicates

• SNPs are sometimes removed (can influence methylation status when close <10bp)

• X/Y chromosomes are also often removed to avoid bias

(12)

MDS plot to evaluate gender outliers

8 outliers

(13)

(14)

«Hidden» QC issues

• Methylated CpGs close to SNPs

• Non-uniquely mapped probes

• Probe design types (type 1, type 2)

• Bias introduced from different plates

batches

(15)

Normalization

• Examine possible problems relating to

«plates» (between and within array corrections)

• Batch-effects when combining multiple analyses (consortium/meta-analyses)

• Correct for type 1 and type 2 probes

(16)

No QC/normalization all chromosomes (left), QC/normalization (right) on two different datasets

(17)

(18)

Betas by Plate

Plate

#

N run

N passed QC

% passed

1 96 92 96%

2 96 69 72%

3 96 80 83%

4 96 87 91%

5 96 67 70%

6 96 83 86%

7 96 90 94%

8 96 88 92%

9 96 69 72%

Total 864 725

(19)

Image Day Batc

h N run N

passed QC

% passe

d 1/14/20

13 1 96 92 96%

1/18/20

13 2 60 42 70%

1/19/20

13 3 84 72 86%

1/20/20

13 4 36 25 69%

1/21/20

13 5 36 30 83%

1/22/20

13 6 12 10 83%

1/25/20

13 7 60 59 98%

1/26/20

13 8 13

2 112 85%

1/27/20

13 9 14

4 126 88%

1/28/20

13 10 96 88 92%

2/4/201

3 NA 12 0 0%

2/23/20

13 11 48 38 79%

2/24/20

13 12 48 31 65%

Betas by Batch

(20)

Dataset (batch) correction

• Necessary when combining 2 or more datasets

• Colored wrt dataset (batch), 2 pictures,

• PCA of dataset 1 and dataset 2 before «ComBat», all chromosomes

(21)

(22)

(23)

Preprocessing and QC

• QC, «manual» using the dataset

• Gender correction can be performed using PCA (/MDS), and plotting (standard functions in R)

• PCA(/MDS) can also be used to assess the needed for within-, between array normalization, not forgetting batch correction

(24)

Normalization and Preprocessing

• Removal of systematic biases is difficult in EWAS

studies. There is also a danger of introducing new ones

• Probe bi-modality can be challenging to work with in a statistical setting

• Normalization procedures tend to depend on datasets.

Sometimes its not needed at all (i.e. this only applies for within and between array normalization).

• Type-correction normalization must always be

performed; BMIQ and RCP seems to be the preferred ones now. Careful with SWAN…

(25)

* Normalization, correction for technical bias (both within- and between array), can be

performed using minfi, WateRmellon, methylumi (Bioconductor)-packages

* ComBat is a good method for performing

between-batch correction. Both parametric and empirical Bayes methods are available. Beware of introduced bias however

* RnBeads also strongly recommended

Normalization and Preprocessing

(26)

Papers that will get you going with pre-processing and QC

Comprehensive analysis of DNA methylation data with RnBeads: Assenov, Y., Müller, F., Lutsik, P., Walter, J., Lengauer, T., & Bock, C.

Preprocessing, normalization and integration of the Illumina

HumanMethylationEPIC array with minfi: Fortin JP, Triche TJ Jr, Hansen KD.

Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays: Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA.

A data-driven approach to preprocessing Illumina 450K methylation array data:

Pidsley R, Y Wong CC, Volta M, Lunnon K, Mill J, Schalkwyk LC.(wateRmellon package)

A framework for analyzing DNA methylation data from Illumina Infinium HumanMethylation450 BeadChip.: Wang Z, Wu X, Wang Y.

A systematic assessment of normalization approaches for the Infinium 450K

methylation platform: Michael C Wu Bonnie R Joubert Pei-fen Kuan Siri E Håberg Wenche Nystad Shyamal D Peddada and Stephanie J London

quantro: a data-driven approach to guide the choice of an appropriate normalization method: Hicks SC, Irizarry RA

(27)

Outline

• Preprocessing (QC, normalisation)

• Transformation (beta-value versus M-value)

• Analysis:

• 1-1 regression (GLM, etc.)

• Shrinkage methods (LASSO/RIDGE+variants)

• Dantzig selector

• More detailed «time series» analyses of each individual(?)

• Region-based analysis of candidate genes and identified regions (DMR) – bumphunting

• Pathways/GO