EWAS –from raw data to results
Jon Bohlin, Senior scientist FHI
Dept of infection control epidemiology and modeling Centre for Fertility and Health
AMR Centre
Course outline
• Part 1: Introduction to epi-genetics and the Illumina Humanmethylation450k platform
• Part 2: Overview of methods for analysis of data from Illumina Humanmethylation450k
Methylation analyses
• Epi-genetics has become popular since it provides
insight into how the environment may influence gene regulation.
• One epi-genetic method is that of methylation. There are several different types of methylation but we will here only focus on one (5-methylcytosine)
• Somewhat simplified we might say that methylation reduces gene expression
• Genomic methylation patterns change throughout the life course
• The technology is new but evolves quickly
• Cytosine
• 5-methylcytosine
•
Illumina 450k nomenclature
• One observation (1 sample)=one array (450k methylation sites)
• 12 observations on 1 slide
• 1 plate max 8 slides (96 arrays)
Methylation platforms
• Illumina and NimbleGen are based on «microarray»- technology
• Oligomers (like GWAS) and bisulfit converted counterparts attached to respective beads
• Hybridisation with a methylated base=green light
• No hybridisation gives red light
• Intensities vary for both lights
• Intensities are converted to M –methylated and U –un- methylated, and subsequently to «beta», 0<=beta<=1
• beta is a ratio of methylated to unmethylated
Type I and II probes
• Type I probes: stronger signal due to distinct probes for methylated/unmethylated sites. Good in low density
CpG regions since it assumes same methylation status for adjacent CpG sites. One color.
• Type II probes: weaker signal and only one probe for methylated/unmethylated. Better in high density CpG regions as it can contain up to 3 underlying CpGs. Two colors.
Illumina beadchip dataset
Workflow –from start to finish
• Quality control
- Removal of bad samples - Removal of bad probes
- Removal of SNP based probes
- Removal of inserted control probes - Removal of gender-issues
• Normalization
- Correct for technical bias (i.e. between plates / batches)
- Correct for technology-specific features (type I/type II bias)
• Analysis
- Identify DMR’s
- Correct for known biases in data - Correct for cell-type (cord blod)
Quality control (QC) –probe selection
• Remove probes with many missing (i.e. >10%)
• Remove observations with bad probes (detection p- value >0.01)
• «gender outliers» and duplicates
• SNPs are sometimes removed (can influence methylation status when close <10bp)
• X/Y chromosomes are also often removed to avoid bias
MDS plot to evaluate gender outliers
8 outliers
«Hidden» QC issues
• Methylated CpGs close to SNPs
• Non-uniquely mapped probes
• Probe design types (type 1, type 2)
• Bias introduced from different plates
batches
Normalization
• Examine possible problems relating to
«plates» (between and within array corrections)
• Batch-effects when combining multiple analyses (consortium/meta-analyses)
• Correct for type 1 and type 2 probes
No QC/normalization all chromosomes (left), QC/normalization (right) on two different datasets
Betas by Plate
Plate
#
N run
N passed QC
% passed
1 96 92 96%
2 96 69 72%
3 96 80 83%
4 96 87 91%
5 96 67 70%
6 96 83 86%
7 96 90 94%
8 96 88 92%
9 96 69 72%
Total 864 725
Image Day Batc
h N run N
passed QC
% passe
d 1/14/20
13 1 96 92 96%
1/18/20
13 2 60 42 70%
1/19/20
13 3 84 72 86%
1/20/20
13 4 36 25 69%
1/21/20
13 5 36 30 83%
1/22/20
13 6 12 10 83%
1/25/20
13 7 60 59 98%
1/26/20
13 8 13
2 112 85%
1/27/20
13 9 14
4 126 88%
1/28/20
13 10 96 88 92%
2/4/201
3 NA 12 0 0%
2/23/20
13 11 48 38 79%
2/24/20
13 12 48 31 65%
Betas by Batch
Dataset (batch) correction
• Necessary when combining 2 or more datasets
• Colored wrt dataset (batch), 2 pictures,
• PCA of dataset 1 and dataset 2 before «ComBat», all chromosomes
Preprocessing and QC
• QC, «manual» using the dataset
• Gender correction can be performed using PCA (/MDS), and plotting (standard functions in R)
• PCA(/MDS) can also be used to assess the needed for within-, between array normalization, not forgetting batch correction
Normalization and Preprocessing
• Removal of systematic biases is difficult in EWAS
studies. There is also a danger of introducing new ones
• Probe bi-modality can be challenging to work with in a statistical setting
• Normalization procedures tend to depend on datasets.
Sometimes its not needed at all (i.e. this only applies for within and between array normalization).
• Type-correction normalization must always be
performed; BMIQ and RCP seems to be the preferred ones now. Careful with SWAN…
* Normalization, correction for technical bias (both within- and between array), can be
performed using minfi, WateRmellon, methylumi (Bioconductor)-packages
* ComBat is a good method for performing
between-batch correction. Both parametric and empirical Bayes methods are available. Beware of introduced bias however
* RnBeads also strongly recommended
Normalization and Preprocessing
Papers that will get you going with pre-processing and QC
Comprehensive analysis of DNA methylation data with RnBeads: Assenov, Y., Müller, F., Lutsik, P., Walter, J., Lengauer, T., & Bock, C.
Preprocessing, normalization and integration of the Illumina
HumanMethylationEPIC array with minfi: Fortin JP, Triche TJ Jr, Hansen KD.
Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays: Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA.
A data-driven approach to preprocessing Illumina 450K methylation array data:
Pidsley R, Y Wong CC, Volta M, Lunnon K, Mill J, Schalkwyk LC.(wateRmellon package)
A framework for analyzing DNA methylation data from Illumina Infinium HumanMethylation450 BeadChip.: Wang Z, Wu X, Wang Y.
A systematic assessment of normalization approaches for the Infinium 450K
methylation platform: Michael C Wu Bonnie R Joubert Pei-fen Kuan Siri E Håberg Wenche Nystad Shyamal D Peddada and Stephanie J London
quantro: a data-driven approach to guide the choice of an appropriate normalization method: Hicks SC, Irizarry RA
Outline
• Preprocessing (QC, normalisation)
• Transformation (beta-value versus M-value)
• Analysis:
• 1-1 regression (GLM, etc.)
• Shrinkage methods (LASSO/RIDGE+variants)
• Dantzig selector
• More detailed «time series» analyses of each individual(?)
• Region-based analysis of candidate genes and identified regions (DMR) – bumphunting
• Pathways/GO