Multiple trait genome-wide association studies: Applications and methods

(1)

Multiple trait genome-wide association studies:

Applications and methods

Marissa Erin LeBlanc

Dissertation presented for the degree of Philosophiae Doctor (PhD)

Department of Clinical Molecular Biology, Institute of Clinical Medicine

and

Oslo Centre of Biostatistics and Epidemiology UNIVERSITY of OSLO

Oslo, February 2016

(2)

© Marissa Erin LeBlanc, 2016

Series of dissertations submitted to the Faculty of Medicine, University of Oslo

ISBN 978-82-8333-241-4

Cover: Hanne Baadsgaard Utigard

Printed in Norway: 07 Media AS – www.07.no

(3)

Acknowledgements

A degree in (bio)statistics has been a dream of mine since completing my MSc in Genetics in 2003. I would like to express my deepest gratitude to my statistical supervisor Dr. Bettina Kulle Andreassen and co-supervisor Professor Arnoldo Frigessi for taking a chance and hiring me, a non-traditional candidate for a PhD in Biostatistics. I would like to thank my clinical co-supervisor Dr. Ole A. Andreassen for providing the opportunity to collaborate on projects involving interesting

applications of statistical genomics. To all three of my supervisors, thank you for your support, patience and for interesting discussions. I look forward to continuing to collaborate with all of you in the future.

I would like to thank Norway and the University of Oslo for providing an environment where one can simultaneously earn a doctorate degree, have a young family and earn a reasonable salary while doing so.

I would like to thank my co-authors and my colleagues at the Oslo Centre of Biostatistics and Epidemiology, both at the University of Oslo and at Oslo University Hospital, for contributing to a positive, stimulating and motivating work environment.

Christian Page and Dr. Verena Zuber deserve particular thanks. I hope to continue working directly and indirectly with all of you for many years to come.

To my family, thank you for your patience. This has been a long time coming.

Rasmus, Alma, Leona and Esben, this is for all of us.

(4)

List of papers Paper 1

LeBlanc, M., Kulle, B., Sundet, K., Agartz, I., Melle, I., Djurovic, S., Frigessi, A. and Andreassen, O.A., 2012. Genome-wide study identifies PTPRO and WDR72 and FOXQ1-SUMO1P1 interaction associated with neurocognitive function. Journal of Psychiatric Research, 46(2), pp.271-278.

Paper 2

LeBlanc, M., Zuber, V., Andreassen, B.K., Witoelar, A., Zeng, L., Bettella, F., Wang, Y., McEvoy, L.K., Thompson, W.K., Schork, A.J., Reppe, S., Barrett-Connor, E., Ligthart, S., Dehghan, A., Gautvik, K.M., Nelson, C.P., Schunkert, H., Samani, N.J., CARDIoGRAM Consortium, Ridker, P.M, Chasman, D.I., Aukrust, P., Djurovic, S., Frigessi, A., Desikan, R.S., Dale, A.M and Andreassen, O.A., 2016. Identifying Novel Gene Variants in Coronary Artery Disease and Shared Genes with Several

Cardiovascular Risk Factors. Circulation Research, 118(1), 83-94.

Paper 3

LeBlanc, M.*, Zuber V.*, Thompson W.K., Andreassen O.A., Frigessi A. and Andreassen., B.K, 2016. A correction for sample overlap in genome-wide association studies in a polygenic pleiotropy-informed framework. (Submitted to Plos Genetics)

*contributed equally

(5)

Table of contents

1 Introduction ... 1

1.1 A primer to human genetics ... 2

1.1.1 Organization of the genetic material ... 2

1.1.2 Transmission of genetic material from parent to offspring ... 2

1.2 A brief historical description of GWAS ... 3

1.2.1 Advent of GWAS ... 3

1.2.2 The Hapmap Project ... 4

1.2.3 What is a GWAS? ... 5

1.2.4 The GWAS era ... 6

2 Methods ... 8

2.1 GWAS – Analytical pipeline ... 8

2.1.1 Review ... 8

2.1.2 Association testing in GWAS ... 8

2.1.3 Corrections for multiple testing in GWAS ... 10

2.1.4 Validation of GWAS “discoveries” ... 11

2.1.5 Meta-analysis in GWAS ... 11

2.1.6 Multiple related phenotypes in GWAS ... 12

2.1.7 Relevant phenotypes for this thesis ... 14

2.1.7.1 Neurocognitive function ...14

2.1.7.2 Coronary artery disease ...15

2.2 Methodology in the post-GWAS era ... 17

2.2.1 Are further discoveries possible with existing GWAS data? ... 17

2.2.2 Example 1: Expression quantitative trait loci (eQTL) and GWAS ... 19

2.2.3 Example 2: Genome annotation and GWAS ... 20

2.2.4 Example 3: Multiple traits, pleiotropy and GWAS ... 21

2.3 Sample overlap in cross-trait analysis of GWAS ... 22

2.4 A primer to false discovery rate methodology ... 24

2.4.1 Benjamini-Hochberg false discovery rate ... 24

2.4.2 The Bayesian approach to the false discovery rate ... 25

2.4.3 Bivariate extensions of the false discovery rate ... 26

2.4.3.1 Conditional false discovery rate ...27

2.4.3.2 Covariate modulated local false discovery rate ...28

3 Aims ... 29

4 Summary of papers in this thesis ... 29

4.1 Paper 1 ... 29

4.2 Paper 2 ... 30

4.3 Paper 3 ... 31

5. Discussion ... 33

5.1 Thesis overview ... 33

5.2 Paper 1 ... 34

5.2.1 Paper 1 – main contributions ... 34

5.2.2 Paper 1 – strengths and weaknesses ... 34

5.2.3 Paper 1 – future work ... 35

5.2.4 Paper 1 – conclusion ... 36

5.3 Paper 2 ... 36

5.3.1 Paper 2- main contributions ... 36

5.4 Paper 3 ... 39

(6)

5.4.1 Paper 3 – main contribution ... 39

5.5 Concluding Remarks ... 40

References ... 42

Figures ... 50 Followed by Papers 1-3 with Supplements

(7)

1 Introduction

This thesis falls into the realm of statistical genomics. Statistical genomics is special in that it involves the integration of theory from two fields: statistics and genetics, in particular quantitative and population genetics. Much like statistics, genetics has an extensive theoretical foundation in the form of mathematical models that show how different evolutionary pressures, namely selection, mutation, migration and random genetic drift, affect gene frequencies and genetic variation. Statistical genomics builds on this knowledge and provides the statistical tools needed to make inference from genomic data, which is universally complex, high dimensional and fraught with multiple testing issues.

The central focus of this thesis is the genome-wide association study (GWAS) and associated applications and statistical methods. In brief, the goal of GWAS is to identify polymorphic loci, specific positions of variation in human DNA, that are associated with a given disease or trait. As will become apparent, this is not a straightforward task, and is filled with both genetic and statistical issues. These include, but are not limited to, dealing with the non-independence of alleles along a chromosome (linkage disequilibrium; LD), the frequency distribution of risk alleles, the statistical modeling of the relationship between disease and genotype, statistical correction for testing up to millions of genetic variants often in a hypothesis-free context, and particularly in this thesis, dealing with multiple correlated traits.

This thesis is divided in to five chapters. The first chapter provides an introduction including a primer to human genetics and historical description of GWAS together with the advent of the genomic era. The second chapter describes the materials and methods used in this thesis, including the GWAS analytical pipeline and new methodology in the so-called post-GWAS era. The second chapter also gives

(8)

details of the false discovery rate methodology and the phenotypes used in this thesis.

The third chapter states the specific aims of this thesis and the fourth chapter gives a brief summary of each PhD paper. Finally, in the fifth chapter, the papers are discussed, and concluding remarks are made.

1.1 A primer to human genetics 1.1.1 Organization of the genetic material

The cells of every life form contain a special molecule called dioxyribonucleic acid (DNA), the so-called "blueprint of life", that influences how each organism develops, functions and passes traits to the next generation. Humans are diploid and have approximately 5x10¹³ cells, each of which has a nucleus where the DNA is organized on 23 pairs of chromosomes. Within an individual, each cell contains the identical DNA sequence of nucleic acids, of which there are four types: adenine (A), guanine (G), cytosine (C) and thymine (T). The Human Genome Project, completed in 2003, decoded the human genome and provided the first map of these ATCGs along the 23 chromosomes. It is possible to provide such a "reference genome"

because about 97% of genome is fixed (Auton et al., 2015). The remainder of the genome shows variation between individuals, and potentially contributes to the similarity between relatives. Similarity between relatives implies heritable variation, i.e. the fraction of phenotypic variability between individuals that can be attributed to genetic variation. Notably heritability is possible to estimate from phenotype data alone and does not require any genotyping. Complex traits that have high heritability are exactly those that are targeted by GWAS.

1.1.2 Transmission of genetic material from parent to offspring

The genetic material is passed on from parent to offspring via a process called meiosis that occurs exclusively in the sex cells and leads to the formation of haploid gametes, containing only one set of chromosomes; these are the sperm in males and

(9)

eggs in females. Crucial to this process is recombination, where novel non-parental combinations of genetic variants are formed along the chromosomes. This is an important contributor to variation in human populations, and is a critical factor to consider when applying statistical methods to genomic data.

If recombination did not exist, all loci along a chromosome would be linked, causing strong statistical dependencies for all loci on a given chromosome. But since recombination does occur, loci that are far apart on a chromosome are inherited independently from each other. LD, i.e. the non-random association between genetic variants (alleles) on a chromosome, tends to be strong for physically close loci and tends to get weaker and weaker as a function of distance. LD decays over time and the pattern and extent of LD seen in human populations has been shaped by population history and events such as bottlenecks (sudden reductions in population size) and periods of rapid growth (Reich et al., 2001). When only genotyping a subset of variants, exploiting patterns of LD is an essential element for a good genome-wide genotyping strategy.

1.2 A brief historical description of GWAS 1.2.1 Advent of GWAS

Long before the Human Genome Project was started, geneticists understood that genetic variation was key to understanding heritable complex traits and disease.

Developing a strategy to identify specific trait loci, however, was and is a complex issue. Critical to this is the allelic spectrum of disease, i.e. the frequency distribution of risk alleles. This topic was heavily debated around the turn of the millennia in anticipation of the genomic era (Pritchard and Cox, 2002, Reich and Lander, 2001, Weiss and Clark, 2002). How many loci contribute to a common trait? Should the risk alleles be common or rare? Several lines of reasoning lead to the conclusion that with a few known exceptions, complex traits are highly polygenic, that is have hundreds or

(10)

thousands of contributing risk loci (Weiss and Clark, 2002, Pritchard and Cox, 2002, Reich and Lander, 2001). The polygenic nature of complex traits is widely accepted as fact. More controversial, particularly in the pre-genomic era, was the allelic spectrum of disease. The debate centered on whether risk variants (alleles) for common disease would be common or rare. Proponents of the so-called common disease/common variant hypothesis (CD/CV) argued that late-onset common diseases should be largely caused by common variants of modest effect size. Common variants are appealing to work with because they are relatively easy and cheap to identify, can be detected in smaller samples (power) and are generally older and geographically dispersed. But, neighbouring older variants have undergone more recomination, breaking down LD compared to more rare, local or more recent variants. The disadvantage here is that with weak LD, more variants need to be typed in order to have reasonable coverage of the common variation in the human genome. The strongest evidence for CD/CV came from the simulation studies of Reich and Lander (2001) who showed the CD/CV was plausible. However Weiss and Clark (2002) provided very strong, theoretical evidence showing that most risk variants would likely not be common. Despite the lack of solid evidence for the CD/CV hypothesis, and despite compelling evidence against this model of common disease, there was a strong push to go forward with a strategy to map the common variation in the human genome. This strategy was likely pursued because, at that time, the technology did not exist for large-scale sequencing studies (essential to map rare variants), and the genetics community largely believed that common variants were the best hope for genetic mapping of disease.

1.2.2 The Hapmap Project

The HapMap project (http://hapmap.ncbi.nlm.nih.gov) was initiated late in 2002

(11)

in order to map all of the common single nucleotide variants in the human genome, known as single nucleotide polymorphism (SNPs), and its first phase was completed in 2005 (Gibbs et al., 2003, International HapMap Consortium, 2005). The HapMap project focused solely on common SNPs, in this case defined as those where the least frequent allele (called the minor allele) occurs in at least 1% of the population. Using the publically-available Hapmap data, and its corresponding LD structure, so-called tag SNPs can be identified. Tag SNPs are SNPs that are very well correlated with all the other SNPs in a defined region. Such a strategy employed when designing the first commercially-available GWAS genotyping panels, such as those offered by

Affymetrix and Illumina.

1.2.3 What is a GWAS?

A GWAS is a type of study that is typically conducted in hundreds or thousands of unrelated subjects. Unrelated subjects can be treated as independent observations, whereas with related subjects, the genetic correlations due to relatedness need to be taken into account. The samples typically come from retrospective cohort studies or cases-control studies. For each subject, a trait or several traits (outcomes, also called phenotypes in genetics) and covariates of interest are measured, and hundreds of thousands to a few millions common genetic variants are genotyped. For case-control studies, independently for each of these SNPs, it is then investigated if the allelic or genotype frequencies are significantly altered between the case and the control groups. This is typically done using logistic regression and the effect size is reported as an odds ratio. Similarly for quantitative traits, linear regression is used

independently for each SNP to see if a given SNP is associated with the outcome. The effect size in this case is reported as the regression coefficient for the SNP term.

Clearly with so many statistical tests being performed, a formal statistical correction

(12)

for multiple testing is a mandatory step in GWAS, and this subject will be explored in detail in Sections 2.1.3 and 2.4.

Although there are several options for modeling SNP genotype, the most common approach is to use an additive model, where SNP genotypes are ordered and then modeled on a continuous scale. For instance a SNP with minor allele “A” and major allele “G” would have 3 possible genotypes: “AA”, “AG” and “GG”. With the additive model, we assume that genotype contributes in an additive manner to the phenotype, which implies that the heterozygous genotype “AG” lies exactly in between the two homozygous genotypes. As such we can translate the genotype categories “AA”, “AG” and “GG” to a continuous scale: 0, 1, 2. Nearly all of the published GWAS studies to date use this approach.

1.2.4 The GWAS era

After the completion of the 1^st Hapmap phase in 2005, it was theoretically possible to create a genome-wide panel of tag SNPs. Simultaneously, advances in genotyping technology meant that it was suddenly efficient both in terms of cost and time to carry out genome-wide genotyping, thanks to commercially-available GWAS chips from Illumina and Affymetrix. The first large-scale, well-designed GWAS for complex disease was published in Nature in 2007 and performed by the Wellcome Trust Case Control Consortium (WTCCC; Messerli et al., 2007). This study seeded a massive publication boom, where the number of GWAS studies has increased at an increasing rate. This is clearly evident in Figure 1, from in the Catalog of Published Genome-Wide Association Studies, showing the number of GWAS publications per calendar year from 2005-2013 (Welter et al., 2014). As of 2013, the catalog contained 1751 curated publications of 11 912 SNPs associations at p < 10^-5.

Clearly the GWAS approach for discovery of disease-associated genetic

(13)

variants has been widely adapted and a large number of trait-associated SNPs have been found. But has GWAS really been a success? It turns out that this is not a simple question to answer. Very few common variants of moderate to major effect have been found via GWAS. GWAS trait-associated common variants have very small effect sizes (Figure 2; Manolio et al., 2009). In fact, the effect sizes of common variants on common disease are universally so small that tens of thousands or even over one hundred thousand subjects are required to conduct a reasonably powered GWAS.

Such sample sizes are impossible to achieve in individual studies and necessitate the formation of international consortia. These consortia perform meta-analysis of essentially all globally available samples for a given trait, with genotyping data available and that meet the inclusion criteria (e.g. often limited to one ethnicity).

Examples include the Psychiatric Genetics Consortium

(https://www.med.unc.edu/pgc) and the CardiogramplusC4D Consortium (www.cardiogramplusc4d.org/) whose data are used in this thesis. Even when including every virtually sample available globally, GWAS are still notoriously underpowered to identify common variants with extremely small effect sizes (odds ratio<1.1) independent of the choice of genotyping platform (Spencer et al., 2009).

In 2009, Manolio et al. coined the term “missing heritability”, referring to the fact that the genetic variants identified by GWAS explain very little of the heritability for most complex traits and common diseases. The so-called missing heritability is likely due to a wide variety of factors including, but not limited to epigenetics, disease-causing rare variants, gene-gene interactions, gene-environment interactions and the underpowered nature of typical GWAS analysis (Eichler et al., 2010). One analytical approach to uncovering part of the missing heritability is to apply more sophisticated statistical methods to existing GWAS data, especially if the methods

(14)

involve the addition of biological knowledge. But before exploring these more advance methods, it is important to establish an understanding of the conventional GWAS statistical analysis pipeline.

2 Methods

2.1 GWAS – Analytical pipeline 2.1.1 Review

Let us first recall that the goal of GWAS is to detect loci associated with variation in a trait of interest, usually in a sample of independent subjects. Let us also recall that, because of the statistical dependencies between loci (i.e. LD), a properly chosen panel of ~1,000,000 SNPs is sufficient to tag most of the common genetic variation in the (European) human genome. As was introduced in Section 1.2.3, assuming the additive genetic model allows us to treat the three genotype categories at a bi-allelic SNP as a continuous variable with coding 0 (minor allele homozygote), 1 (heterozygote) and 2 (major allele homozygote).

We will first consider a simple GWAS design, with one retrospective sample, one phenotype and a one-SNP-at-a-time regression analysis. In GWAS, the phenotype is either categorical (usually binary case/control) or continuous (called “quantitative”

in genomics literature). An underlying assumption for a successful GWAS is that the chosen genotyping platform has reasonable genomic coverage for the population from which the samples have been drawn.

2.1.2 Association testing in GWAS

Quantitative phenotypes are analyzed using a linear regression approach, with one SNP and clinical covariates are predictor variables. For simplicity, let us consider a regression model without covariates, but in practice covariates can easily be added.

For n samples with outcome yj, j = 1,…,n, and additively-modelled genotype xjg in

(15)

individual j for SNP g, the linear regression model is yj = g + gxjg + jg, where

jg is normally distributed with mean 0 and describes the error term of the relationship between outcome and genotype. The null hypothesis is for each SNP g that the coefficient of the SNP term, g, is equal to zero and the test statistic is a

Wald test, !!

!"(!_!), where se is the standard error. Here the estimated coefficient

of the SNP term is reported as the effect size. Binary phenotypes are usually analyzed using a logistic regression approach, again with one SNP and clinical covariates as predictor variables. Covariates are again dropped for simplicity. Here, we code the outcome Yj as 1 cases and 0 for controls, and the logistic regression model is log Pr !!=1

Pr!_!=0 =!!_!+!_!!_!". The null hypothesis for each SNP g is

that the probability of being a case or a control is not associated with genotype. The effect size of a given SNP is reported as an odds ratio.

In the regression analysis step, it is possible to correct for genetic within sample differences (population stratification) by including the first few principle components derived from the genome-wide SNP panel, which can be interpreted as a sort of

“origin score”. This correction is desirable because it is protective against spurious associations in the case of both different phenotypic distributions and different allelic frequency distributions in the different subpopulations that may exist in the dataset.

After one regression model is built for each SNP on the genotyping panel, the strength of association between each SNP and the phenotype can be summarized by an effect size, associated confidence interval and a p-value. Given that on the order of one million SNPs are included in a GWAS, correction for multiple testing is a critical step in any GWAS analysis. There are several options here, as well as clear

conventions established in the GWAS literature.

(16)

2.1.3 Corrections for multiple testing in GWAS

A p-value, which is the probability of seeing a test statistic equal to or greater than the observed test statistic if the null hypothesis is true, is generated for each statistical test. Statistical tests are generally called significant (i.e. the null hypothesis is rejected) if the p-value falls below a predefined , most often set to 0.05 and known as the type I error rate. This probability is for a single statistical test but in GWAS on the order of 10⁶ tests are conducted. If we were to declare SNPs as significantly associated with phenotype based on their p-values and a cut-off of 0.05, the cumulative type I error rate over all statistical tests is much greater than 0.05. As such, formal corrections for multiple testing are necessary, in order to maintain an overall type I error rate of 0.05 for the entire GWAS.

The simplest approach to correcting for multiple testing is the Bonferroni correction. The Bonferroni correction adjusts the alpha value from = 0.05 to = (0.05/m) where m is the number of statistical tests conducted, i.e. the number of SNPs in the GWAS.

A related approach is to use the Bonferroni correction for genome-wide significance. The Bonferroni corrected significance threshold for a million tests is 0.05/1,000,000 = 5x10^-8, and this cut off very commonly used as the “gold standard”

for declaring an association significant in GWAS, regardless of the number of SNPs on the genotyping panel. This is because, for the European population it is estimated that there are approximately one million independent common SNPs in the genome, once the dependencies due to LD are taken into account (Clarke et al., 2011). Another estimate is 7.2x10-8 but p<5x10^-8is the most common choice in the literature (Dudbridge and Gusnanto, 2008, Pe'er et al., 2008).

The Bonferroni correction is appropriate when a single false positive in a set of tests would be a problem, otherwise is a very conservative approach and potentially

(17)

leads to a large number of false negatives. An alternative, less conservative approach for correcting for multiple testing involves controlling for the expected proportion of false discoveries amongst the rejected null hypothesis instead. In GWAS, this is the proportion of trait-associated SNPs that are actually false positives. The first statistical procedure for controlling the false discovery rate (FDR) was proposed by Benjamini and Hochberg (1995). In brief, the p-values are ordered from smallest to largest, and assigned a corresponding rank i. For instance, for the smallest p-value, i = 1. Compare each individual p-value to its Benjamini-Hochberg critical value, i/m)*q, where i is the rank, m is the total number of tests, and q is the false discovery rate you choose. The largest p-value that is less than (i/m)*q is significant, and all of the p- values smaller than it are also significant.

Importantly, the FDR and Bonferroni corrections do not re-order the SNPs compared to their raw p-value rankings; they simply suggest different cut-off points as to what is declared as statistically significant. In Section 2.4 we will further explore procedures for correcting for multiple testing, including some methods that re-order the SNPs compared to their raw p-value rankings. But for the time being, let us continue to describe the typical GWAS pipeline.

2.1.4 Validation of GWAS “discoveries”

The gold standard for validation of a GWAS association is the replication of the association in an independent sample. Here the burden of multiple testing is less severe and the correction only needs to be made for the number of SNPs in the

“associated” SNP set carried forward to the validation step, often on the order of 50 – 100 SNPs.

2.1.5 Meta-analysis in GWAS

The description of the GWAS analytical pipeline above assumes that the individuals are from one sample. In practice, nearly all GWAS studies of major

(18)

impact are conducted by consortia who collect as many studies as possible to be combined in a meta-analysis. Here issues such as phenotype definition, inclusion criteria, population stratification and genotyping platform become critically important. Imputation of missing genotype data is usually required. Genotype imputation exploits known LD patterns and haplotype frequencies in a reference population (e.g. from HapMap or the 1000 Genomes project) to estimate genotypes for SNPs not directly genotyped in the study. Meticulous routines for data storage, security, privacy and access are required. Detailed discussion of these issues is beyond the scope of this thesis but it is important to keep these potentially complicating factors in mind.

Assuming all of the issues above have been dealt with in a reasonable manner, conducting a GWAS meta-analysis for a given phenotype is straightforward. Each contributing study provides regression-derived effect size and associated standard error and the sample size for each SNP. Importantly, each study must specify the reference allele at each SNP; otherwise the effect direction cannot be aligned between studies. Subsequently, meta-analysis, such as inverse variance meta-analysis is conducted. The meta-analysis effect size estimate and associated p-value are then reported for each SNP and correction for multiple testing is performed. Usually the GWAS consortium will exclude some of its contribution studies from the meta- analysis and reserve them for a second phase of analysis (i.e. validation of the associated SNPs).

2.1.6 Multiple related phenotypes in GWAS

It is common that several, related phenotypes are investigated by GWAS. This can be carried out in one study using the same sample set (e.g. the Global Lipids Consortium used the same sample to investigate several outcomes including

(19)

triglycerides, high-density lipoprotein, low-density lipoprotein and total cholesterol (Teslovich et al., 2010). Related phenotypes can also be investigated in

(approximately) independent samples by separate consortia and published in separate publications (e.g. blood pressure (Ehret et al., 2011) and triglycerides (Teslovich et al., 2010)). When related phenotypes are investigated in the same sample, it is often because there is not one obvious primary phenotype, and it cost-effective to look at as many heritable phenotypes as possible using the same dataset. Other motivations for investigating related phenotypes (in one sample or in independent samples) include that the genetic basis so-called endophenotypes (stable phenotypes with a clear genetic connection) should be easier to identify than broader clinical definitions of disease or other quantitative traits (such as body mass index). Some related complex phenotypes, like type 2 diabetes and coronary artery disease, clearly merit their own consortium-level investigation.

Until recently, it was not common to integrate cross-phenotype results in any formal way. However, informal investigations of overlapping “discoveries”, usually at the gene level, were often made. An example of this is the Venn diagram summarizing the findings of the Global Lipids Consortium, which gives a visual display of the overlapping gene sets for the four investigated lipids phenotypes (Figure 3). It is perhaps expected that related lipids phenotypes will also have overlapping gene sets, given their strong phenotypic correlation.

Statistics has well-developed methodology for dealing with multivariate data but these methods are rarely applied to GWAS data in order to deal with multiple, related phenotypes. The reasons for this are not entirely clear, but likely just have to do with conventions in the field of genomics. For a summary of multivariate methods for GWAS see Galesloot et al. (2014).

(20)

2.1.7 Relevant phenotypes for this thesis 2.1.7.1 Neurocognitive function

Neurocognitive function broadly refers to multiple inter-correlated cognitive domains including attention, psychomotor speed, learning and memory, intelligence and executive functioning. In Paper 1, we investigate twenty-four neurocognitive tests falling into these five clinical domains via GWAS.

Heritability estimates for different aspects of neurocognitive function range from approximated 50 to 80% (e.g. Lee et al., 2010). Despite its high heritability, neurocognitive function is a particularly challenging phenotype to investigate via GWAS. Reasons for this include: the multivariate nature of neurocognition, the lack of a clear primary phenotype and a lack of consistent phenotype definitions across studies (due to different test batteries for neurocognition). Additionally, there is no consensus on how to deal with important covariates and confounders such as age, education and underlying diseases. Options here include using these as

inclusion/exclusion criteria or including them as covariates in the statistical model.

All in all, these particular challenges encountered for GWAS of neurocognitive function result in highly underpowered studies with limited or no options for replication.

Presently, nine loci have been associated with the key words “general cognitive ability” or “intelligence” or “cognitive test” or “neurocognive function” in the GWAS catalogue (http://www.ebi.ac.uk/gwas/home) at a p-value implying genome-wide significance (p-value < 5x10^-8). The results are summarized in Table 1. We include the results from Paper 1 in this list since it was published already in 2012.

(21)

Table 1. Single nucleotide polymorphisms associated with neurocognition at p-value

< 5x10^-8. Chr, chromosome.

rs# Gene Chr Reference

rs10457441 intergenic 6 (Davies et al., 2015) rs17522122 AKAP6 14 (Davies et al., 2015) rs10119 TOMM40 19 (Davies et al., 2015)

rs2300290 PTPRO 12 (LeBlanc et al., 2012, i.e Paper 1) rs719714 WDR72 15 (LeBlanc et al., 2012, i.e. Paper 1) rs6043979 KIF16B 20 (Loo et al., 2012)

rs3758171 PAX5 9 (Loo et al., 2012)

rs3815908 ELSPBP1 19 (Loo et al., 2012)

rs17518584 CADM2 3 (Ibrahim-Verbaas et al., 2015)

2.1.7.2 Coronary artery disease

In Paper 2, coronary artery disease (CAD) is investigated via GWAS. CAD is a leading cause of death worldwide. CAD happens when the arteries that supply blood to the heart acquire a build up of cholesterol and plaque causing them to be hardened and narrowed. Less blood is able to flow through the arteries causing less oxygen to get to the heart, leading to heart attack and often to permanent heart damage or even death. CAD also leads to heart failure and irregular beating of the heart. The heritability of CAD is approximately 40-50% (Peden and Farrall, 2011).

Several related consortia have investigated the genetics of CAD via GWAS leading to the identification of 46 CAD-associated loci achieving both p-value<5x10^-8 and validation in an independent dataset (CARDIoGRAMplusC4D Consortium et al., 2013; Table 2). These 46 loci were for the most part identified via consortium-based efforts including that of the CARDIoGRAMplusC4D Consortium whose summary statistic data is used in Paper 2.

(22)

Table 2. Single nucleotide polymorphisms associated with coronary artery disease at p-value < 5x10^-8. Loci are reported at least one of the following publications:

(CARDIoGRAMplusC4D Consortium et al., 2013, Schunkert et al., 2011, Samani et al., 2007, Clarke et al., 2009, Kathiresan et al., 2009, Soranzo et al., 2009, Wang et al., 2011, IBC!50K!CAD!Consortium, 2011).

rs# Chr Gene

rs4845625 1 IL6R

rs515135 2 APOB

rs2252641 2 ZEB2-AC074093.1

rs1561198 2 VAMP5-VAMP8-GGCX

rs7692387 4 GUCY1A3

rs273909 5 SLC22A4-SLC22A5

rs10947789 6 KCNK5

rs4252120 6 PLG

rs264 8 LPL

rs9319428 13 FLT1

rs17514846 15 FURIN-FES

rs2954029 8 TRIB1

rs6544713 2 ABCG5-ABCG8

rs1878406 4 EDNRA

rs2023938 7 HDAC9

rs602633 1 SORT1b

rs11206510 1 PCSK9

rs6725887 2 WDR12

rs9818870 3 MRAS

rs12190287 6 TCF21

rs3798220 6 SLC22A3-LPAL2-LPA

rs11556924 7 ZC3HC1

rs1333049 9 CDKN2BAS1

rs579459 9 ABO

rs12413409 10 CYP17A1-CNNM2-NT5C2

rs2505083 10 KIAA1462

rs974819 11 PDGFD

rs3184504 12 SH2B3

rs4773144 13 COL4A1-COL4A2

rs2895811 14 HHIPL1

rs12936587 17 RAI1-PEMT-RASD1

rs1122608 19 LDLR

rs9982601 21 Gene desert (KCNE2)

rs17114036 1 PPAP2B

rs17609940 6 ANKS1A

rs12526453 6 PHACTR1

rs501120 10 CXCL12

rs1412444 10 LIPA

rs46522 17 UBE2Z

(23)

rs216172 17 SMG6 rs2075650* 19 ApoE-ApoC1

rs445925* 19 ApoE-ApoC1

rs17464857 1 MIA3

rs12539895 7 7q22

rs9326246 11 ZNF259-APOA5-APOA1

rs7173743 15 ADAMTS7

*not in high LD

2.2 Methodology in the post-GWAS era

2.2.1 Are further discoveries possible with existing GWAS data?

We have already described the “missing heritability” problem in GWAS.

Although it is tempting to abandon common variants altogether and look for the missing heritability elsewhere, several lines of evidence suggest that there are still discoveries to be made in existing GWAS data. By looking at the quantile-quantile plot from almost any given large GWAS, it is clear that the p-value distribution has many more small p-values than expected by chance, and that the typical Bonferroni threshold used to declare statistical significance results in a large number of false negatives. A quantile-quantile plot for a typical Consortium-based GWAS is shown in Figure 4. By convention, these plots are displayed on the –log10 scale. Clearly, observed p-value distribution departs from the null distribution before the typical GWAS significance threshold of p-value < 5x10^-8. This is strong empirical evidence that there are many false negatives in GWAS when a standard analytical pipeline is used. In many ways this is not surprising since the typical GWAS analysis is highly conservative, underpowered and done in a hypothesis-free manner (i.e. SNPs are treated as exchangeable). The question arises: Can we do better? Is there anything in statistics or biology that can help us to get more out of the existing data? The short answer is yes, that by using more advanced statistical methods, particularly those that incorporate additional biological knowledge, it is possible to make new discoveries in the GWAS data we already have available.

(24)

So what type of biological knowledge can be used to aid in the hunt for disease- causing genes? Prior to the GWAS era (pre-2007), it was common to focus the hunt for disease-variants in tens or hundreds of known protein coding genes. This so-called

“candidate gene approach”, where genes are selected according to a priori knowledge of the gene’s biological function, has not been a useful way to identify genetic variation associated with disease or traits. By focusing on biologically-relevant genes, the burden of multiple testing is reduced. Even with less stringent significance thresholds, the candidate gene approach was not been very successful at identifying trait-associated loci. It turns out that, even if we know which genes are important for disease etiology, this is not equivalent to knowing where important genetic variation lies. Again, this is not entirely surprising since population genetics tells us that the more important a gene is for survival and function, the less natural variation we expect to see in the gene. Other reasons for the failure of the candidate gene approach may be that we know altogether too little about which genes may play a role in disease, too little about regulatory or other non-coding genetic variants, or maybe we simply know too little about important common variation in the genome in general.

In the last few years, our knowledge about the structure of the genome and particularly regulatory elements has exploded. Our understanding of the human genome has moved long past the "central dogma of molecular biology" that says that a gene is a piece of DNA that codes for a piece of messenger RNA (mRNA) that in turn codes for a protein. Although the central dogma is a good description of bacterial genomics, it is not an adequate description of how the human genome works. We now know that a large part of the important genetic variation lies outside of the tiny bits of the genome that code for proteins, and instead are involved in gene regulation. The ENCODE (Encyclopedia of DNA Elements) Consortium aims to identify all

(25)

functional elements in the human genome and maintains a comprehensive webpage and database (https://www.encodeproject.org). A better understanding of the regulatory elements of the genome has largely been driven by new technology and clever application of the new technology. Paired with bioinformatic tools (i.e.

methods and software tools for understanding biological data), so-called “-omics”

studies have led to several major genome-level insights about how gene regulation works.

The second-wave analysis of GWAS, characterized by improved use of bioinformatics, statistics and genetical knowledge is still in its infancy but has lead to the development of exciting and promising approaches for the discovery of disease- associated genetic variants. A detailed review of all of these new –omics technologies and related methodology is beyond the scope of this thesis, so we will instead focus on examples of how particular new insights into how human genetics works have been incorporated into GWAS analysis.

2.2.2 Example 1: Expression quantitative trait loci (eQTL) and GWAS

Increasing evidence suggests that single nucleotide polymorphisms (SNPs) associated with complex traits are more likely to be expression quantitative trait loci (eQTLs) than would be expected by chance alone. Beginning around 2007,

researchers (Stranger et al., 2007) began innovative genome-level studies in humans to find genetic variants (usually SNPs) that associate with variation in gene

expression (usually at the mRNA level), termed eQTL experiments (see early review in Gilad et al., 2008). Here the goal is to identify SNPs that exhibit genotype- dependent gene expression (mRNA), with focus usually being on nearby protein- coding genes. The focus on nearby protein-coding genes is in part to reduce the burden of multiple testing, because we know that genetic variants can also influence

(26)

gene expression of distant genes such as genes on other chromosomes. Nearby eQTLs are called cis-eQTLs and distant eQTLs are called trans-eQTLs. Since gene

expression is tissue dependent, eQTLs are specific to a given tissue type, for instance adipose tissue or blood. The basic idea for any cis-eQTL analysis involves first defining “nearby” (e.g. limit association analysis to genes +/-1000 kb from a given SNP), then calculating an association statistic between the given SNP and the mRNA expression data one at a time for all “nearby” genes. This results in one p-value for each SNP-nearby gene pair.

It has been shown that eQTLs are enriched for SNPs associated with complex diseases and traits using GWAS (Cookson et al., 2009, Nicolae et al., 2010). As such, eQTLs are one type of biological information that can be used to re-prioritize GWAS findings. For example, Westra et al. (2013) incorporate eQTL information into GWAS using p-value weighting methods. In brief, the GWAS p-values are reweighted by weights based on the eQTL p-value for each SNP. This is just one example of how eQTL can be incorporated into a GWAS analysis, in this case at the summary statistic level.

2.2.3 Example 2: Genome annotation and GWAS

Genome annotation can also be used to improve gene discovery in existing GWAS data. SNPs can be annotated to different genomic regions such as regulatory elements, coding genic elements, introns, and intergenic regions. The annotation is based not only on the exact physical location of a given SNP but also on LD with the SNP and the different genomic elements. Schork et al. (2013) show that certain genomic elements are enriched for small p-values in GWAS, indicating that genomic annotation is useful for breaking the exchangeability assumption of the standard GWAS pipeline. This assumption is broken when SNPs come from pre-determinable

(27)

categories or clusters, within which they can be dependent and share distributions of effects. Using genome annotation as informative prior information means that a posteriori SNPs are no longer exchangeable and are no longer identically-distributed.

The suggestion of Shork et al. is to incorporate the annotation information in a conditional false discovery rate setting (see Section 2.4.3.1 for more on the conditional false discovery rate).

2.2.4 Example 3: Multiple traits, pleiotropy and GWAS

The idea that one gene can influence more than one phenotype is well

established in genetics (pleiotropy). The phenotypes may be obviously connected (e.g.

high-density lipoprotein and low-density lipoprotein) or less obviously connected (e.g. the sickle cell anemia gene leads to both changes in red blood cell morphology and to improved resistance to malaria). When pleiotropy is also highly polygenic (i.e.

there are many genes effecting both phenotypes), it should be detectable at the genome-level. Andreassen et al. (2013) use stratified quantile-quantile plots (Figure 5) to visualize this. When stratifying the GWAS p-values for a first trait based on their significance in a second related trait, there is more and more leftward deflection on the plot. This indicates that on the genomic scale, the p-values in the second trait are informative about significance in the first trait, indicating polygenic pleiotropy.

This implies that the p-values from trait 2 can be useful prior information to incorporate into the analysis to discover SNPs associated with trait 1. Using conditional false discovery rate (Section 2.4.3.1), the exchangeability assumption implicit in standard GWAS is broken, and the analysis favors those SNPs that are associated with both trait 1 and 2. The methods of Andreassen et al. (2013) are used in Paper 2 and the related polygenic-pleiotropy informed methods of Zablocki et al.

(2014) are used in Paper 3.

(28)

2.3 Sample overlap in cross-trait analysis of GWAS

Analysis of GWAS data in the post-GWAS era often requires the integration of GWAS data for related traits, usually at the summary statistic level. There are several potential advantages when working with cross-trait GWAS, including increasing the power and sophistication of the statistical methodology, and the possibility to ask more sophisticated biological questions. Since summary statistics do not contain any sensitive information, it is now common practice for GWAS consortia to release their summary statistics for public download from their homepages. Summary statistics are efficient to work with compared to genotype-phenotype data, and when a sufficient statistic is used, they contain all information needed for further inference.

When the GWAS sample for a first trait overlaps with the GWAS sample for a second trait, the test statistics for a given SNP will be spuriously correlated, even when genotype is independent from both phenotypes. Lin and Sullivan (2009) were the first to address the methodological challenge of integrating GWAS with overlapping subjects. Using the correlation between the maximum likelihood estimates for the regression coefficients for a given SNP g, correlation due to overlap for two case control-studies is:

!"# !_!,!_! ≈ _!^!

! !_! !_!! exp!(∝_!+∝_!)!+ _!"#(∝^!^!!

!!∝_!) !!!!!!!!!!!!!!!!!!!!!!!!!!!!![2.3.1]

where exp !!+!!! ≈!!!!!!" !!"!!", and n1 is the sample size of study 1, and n2

the sample size of study 2, and where we denote the number of cases in study 1 and 2 as n11 and n21 respectively, similarly n10 and n20 for the number of controls in study 1 and 2 respectively, and denote the overlap in controls by nc0 and in cases by nc1. To calculate this, one needs only the summary statistics and the numbers of overlapping and non-overlapping subjects, which in practice can often be determined from the original GWAS publications. In Paper 3, we use approach of Lin and Sullivan to

(29)

provide analogous formulas for the correlation due to overlap for all possible pairings of GWAS studies.

In some situations, it is not possible to determine the actual number of overlapping subjects. In this case the GWAS summary statistics for the two traits can be used to estimate the correlation. If all SNPs in both GWAS were null (i.e. truly independent from both phenotypes), and if the samples were non-overlapping, the correlation of the summary statistics would be approximately 0. If all SNPs in both GWAS were null but samples overlapped, the correlation of summary statistics would give an estimate of the spurious correlation due to overlap. But in reality, GWAS summary statistics contain both null and non-null SNPs (i.e. those with a true genetic effect on one or both phenotypes). Thus, correlation of the summary statistics would include both the effect of the overlapping subjects and the effect of the non-null SNPs, which may be truly correlated (i.e. pleiotropy). To date there are two proposed methods for estimating the correlation due to sample overlap from GWAS summary statistics.

Province and Borecki (2013) propose using the tetrachoric correlation of a binary transformation of summary statistics to estimate the correlation due to overlap.

In their proposed method they categorize the GWAS summary statistics for each study (z-scores) as z<0 and z>0. They then calculate the tetrachoric correlation of the resulting categorized vectors. They argue and show with simulation studies that this protects against the influence of the non-null SNPs in estimating the correlation due to sample overlap.

Zhu et al. (2015) also derive a formula for estimating the correlation due to sample overlap based on summary statistics. First, prune the GWAS summary statistics down to an independent set of SNPs (based on known LD structure in the

(30)

data). Second, calculate:

corr(T1,T2)={ g(Tg1− 1)(Tg2− 2)}/sqrt{ g(Tg1− 1)²(Tg2− 2)²} [2.3.2]

where T1,T2 are the test statistics for the SNPs for traits 1 and 2 in their corresponding cohorts, and 1 and 2 are their corresponding means. Their method assumes that all correlation in the test statistics can be either attributed to overlapping or related samples in the two studies. Therefore this method for estimating correlation due to sample overlap will not work well if there is also polygenic pleiotropy.

2.4 A primer to false discovery rate methodology 2.4.1 Benjamini-Hochberg false discovery rate

As discussed in Section 2.1.3, correction for multiple testing is an essential part of GWAS analysis. Although the historical “gold standard” correction is a

Bonferroni-based cut-off of 5x10^-8, this is indisputably an overly conservative approach (for example see empirical evidence of this in the typical GWAS quantile- quantile in Figure 4). A viable and more liberal alternative is instead controlling the false discovery rate (FDR). As introduced in Section 2.1.3, the FDR was first introduced in the landmark paper ‘Controlling the false discovery rate: a new and powerful approach to multiple comparisons’ by Benjamini and Hochberg (1995).

Interestingly, further development of FDR-based methodology has been largely inspired by problems arising in genomics research, where studies involving gene expression microarray experiments were the first application to present with multiple testing challenges on such an enormous scale (Benjamini, 2010).

To control the number of false discoveries, i.e. the expected ratio, E(V/R), of the number of false positives V among all significant tests R, Benjamini and Hochberg introduced a step-up procedure that is guaranteed to control E(V/R) at a level less than q, the desired FDR control. We revisit this procedure, first introduced in Section

(31)

2.1.3, with the addition of more formal notation. First, order the m p-values from smallest to largest, p(1) ≤ ... ≤ p(m₎and assign a corresponding rank i to each p-value.

Compare each individual p-value to its Benjamini-Hochberg critical value, (i/m)*q.

Define k = max(i : p(i) ≤ (i/m)*q) and all hypotheses belonging to p(1),...,p(k) are rejected. Thus the largest p-value that is less than (i/m)*q is significant, and all of the p-values smaller than it are also significant as well.

2.4.2 The Bayesian approach to the false discovery rate

The FDR has subsequently been approached from a Bayesian perspective (see Storey, 2002, Efron et al., 2001, Efron and Tibshirani, 2002, Efron, 2008).

Fundamental to the Bayesian approach is the two-group model, where each of the m tests is either null or non-null with prior probability 0 or 1 = 1 - 0 respectively.

The p-value, p1g, or more generally the test statistic for SNP g, z1g, has a different distribution based on whether it is null or non-null. In the following we drop the g subscript for simplicity. Let F0(z1) and F1(z1) denote the cumulative distribution functions of the density functions f0(z1) and f1(z1), for the null and non-null densities functions respectively. As such, z1 follows a two-group mixture model with cumulative distribution function:

F(z1) = π0F0(z1)+ π1F1(z1) [2.4.2.1]

and density function

f(z1) = π0f0(z1)+ π1f1(z1). [2.4.2.2]

From here we can use Bayes theorem and define the tail area-based FDR (Fdr) as Fdr(z1) = Pr(null | Z ≥ z1) = π0F0(z1)/ F(z1) [2.4.2.3]

and the local FDR (fdr) as

fdr(z1) = Pr(null | Z = z1) = π0f0(z1)/ f(z1). [2.4.2.4]

Fdr is very much like a corrected p-value and connects very closely to the Benjamini

(32)

and Hochberg FDR (Efron, 2008). In order to estimate Fdr or fdr, one proceeds by fitting the mixture model in either Equation 1.6.1 or 1.6.2 to the observed data. This can be done using either a theoretical null model (e.g. standard normal distribution for z-scores or uniform(0,1) distribution for p-values), or an empirical null model (e.g.

specifying a distribution type but estimating the parameters from the data).

Additionally, an estimate of f(z1) or F(z1) is required, as is an estimate of π0 (which in GWAS can reasonable and conservatively be set to 1). F(z1) can be estimated by the empirical cumulative distribution funtion mp/m, where mp is the number of tests with a z-score greater than or equal to z1 and m is the total number of tests. GWAS data is particularly well-suited to Fdr or fdr estimation since m is very large (on the order of 10⁶) and π0 well approximated by 1 (that is, only a few dozen to a few hundred common variants out of ~one million are expected to be non-null).

2.4.3 Bivariate extensions of the false discovery rate

The FDR methods described above implicitly assume the exchangeability of SNPs. Breaking this assumption and incorporating prior information on each SNP will improve power, as long as this prior information is truly a useful covariate. The prior information could be different kinds of annotation (see Sections 2.2.2 to 2.2.4 for examples), but in this thesis we focus on pleiotropy. The basic idea is that in the presence of polygenic pleiotropy, the GWAS summary statistic of a second trait (z2) can be informative for FDR modeling for the first trait (z1). Papers 2 and 3 in this thesis use FDR methodology involving bivariate extensions to Equations 2.4.2.3 and 2.4.2.4. Paper 2 uses the conditional FDR (condFdr) and the related conjunctional FDR (conjFdr), extensions of the Fdr, using estimating procedures described in Andreassen et al. (2013). Paper 3 uses the covariate-modulated FDR (cmfdr), an extension of the fdr, proposed by Ferkingstad et al. (2008) using estimating

(33)

procedures first described in Zablocki et al. (2014).

Conceptually, a full mixture model for two traits is a four-group mixture model, given by the following density function:

f(z1, z2) = π0f0(z1, z2) + π1f1(z1, z2) + π2f2(z1, z2) + π3f3(z1, z2) [2.4.3.1]

and where π0 is the proportion of SNPs for which both phenotypes are null, π1 is the proportion of SNPs where both phenotype 1 and 2 are non-null (i.e. the pleiotropic SNPs), π2 is the proportion of SNPs where phenotype 1 is null and phenotype 2 is non-null, and π3 is the proportion of SNPs where phenotype 2 is null and phenotype 1 is non-null. Likewise, f0(z1, z2) is the density function for the SNPs where both phenotypes are null, f1(z1, z2) is the density function for the SNPs where both phenotypes are non-null, f2(z1, z2) is the density function for the SNPs where phenotype 1 is null and phenotype 2 is non-null and f3(z1, z2) is the density function for the SNPs where phenotype 2 is null and phenotype 1 is non-null. This full specification of the four-group mixture model is a useful starting point since it classifies SNPs into four biologically-interpretable categories, which may be useful for future inference. In practice, a simplified mixture model is usually assumed for estimation procedures in bivariate extensions of the local false discovery rate. If we imagine that all non-null SNPs are non-null for both trait 1 and trait 2 (i.e., π2 and π3

are 0), the mixture model in Equation 2.4.3.1 simplifies to:

f(z1, z2) = π0f0(z1, z2) + π1f1(z1, z2). [2.4.3.2]

2.4.3.1 Conditional false discovery rate

The condFdr is defined, using Bayes Theorem, as:

condFdr(z1 | z2) = Pr(null for trait 1 | Z1≥ z1 and Z2 ≥ z2)

= π0 (z2)F0(z1 | z2) / F(z1 | z2) [2.4.3.1.1]

or on the p-value scale,

(34)

condFdr(p1 | p2) = Pr(null for trait 1 | P1 ≤ p1 and P2 ≤ p2)

= π0 (p2)F0(p1 | p2) / F(p1 | p2) [2.4.3.1.2]

Under the null hypothesis p1 and p2 are independent so F0(p1 | p2) = F0(p1) = p1. This can be thought of as the expected quantile of p1 under the null hypothesis.

Therefore

condFdr(p1 | p2) = π0 (p2)p1 / F(p1 | p2). [2.4.3.1.3]

Conservatively, π0 (p2) is set to 1. The conditional cumulative distribution function, F(p1 | p2), needs to be estimated from the data. This can be thought of as the observed quantile of p1 conditioned on the p-value in the second trait being as small as or smaller than the observed p-value, p2. The approach taken here is described in detail in Andreassen et al. (2013). In brief SNPs are binned into a “look-up table”, with the p-value in the first trait in the rows and the p-value from the second-trait in the columns. From this table, the observed quantile of p1 amongst the subset of SNPs for which the p-value for the second trait is as small as or smaller than p2 is calculated.

2.4.3.2 Covariate modulated local false discovery rate

The local false discovery rate has also been extended to include information from a second variable. This extension was first proposed by Ferkingstad et al. (2008) and further developed by Zablocki et al. (2014). In Paper 3, we use the estimation procedures of Zablocki et al. (2014).

The cmfdr is defined, using Bayes Theorem, as:

cmfdr(z1 | z2) = Pr(null for trait 1 | Z1= z1 and Z2 = z2)

= π0 (z2)f0(z1) / f(z1 | z2)

= π0 (z2)f0(z1) /{ π0 (z2)f0(z1) + π1(z2)f1(z1 | z2)}. [2.4.3.2.1]

Here it is required to estimate the proportion of SNPs that are null for trait 1 given that Z2=z2, the parameters for the null density function for z1, which is assumed

(35)

independent from z2 and the non-null density function for z1 given that Z2=z2. A fully Bayesian estimation procedure is followed, where f0(z1) follows a folded normal distribution with mean 0 and f1(z1|z2) follows a gamma distribution. Here the shape parameter is modeled as dependent on z2 and but the rate parameter is assumed independent from z2. The proportion of non-null SNPs for trait 1 is dependent on z2, and is modeled using a logistic regression procedure. The implementation of this procedure in R is available from the authors at:

https://sites.google.com/site/covmodfdr/.

3 Aims

This thesis aims to apply and improve analyses of GWAS data, specifically using a standard GWAS pipeline for the genotype-phenotype data from TOP study for multiple neurocognitive traits (Paper 1), and using pleiotropy-informed false

discovery rate methodology for summary statistic data from the

CARDIoGRAMplusC4D Consortium and related cardio-metabolic traits for CAD, in order to find trait-associated genetic variants (Paper 2). We aimed to propose a method to adjust for sample overlap in cross-trait analysis of GWAS data when only summary statistics are available (Paper 3).

4 Summary of papers in this thesis 4.1 Paper 1

LeBlanc, M., Kulle, B., Sundet, K., Agartz, I., Melle, I., Djurovic, S., Frigessi, A. and Andreassen, O.A., 2012. Genome-wide study identifies PTPRO and WDR72 and FOXQ1-SUMO1P1 interaction associated with neurocognitive function. Journal of psychiatric research, 46(2), pp.271-278.

The aim of this paper was to find SNPs/genes associated with neurocognitive function using the standard GWAS approach. Samples were from the Thematically Organized Psychosis (TOP) Study conducted at Oslo University Hospital. The sample

(36)

included healthy individuals (n = 377) and patients with schizophrenia spectrum disorders (n = 204) and bipolar disorders (n = 177) having genotype (Affymetrix Genome-Wide Human SNP Array 6.0) and neurocognitive data available. Twenty- four neurocognitive tests falling into five clinical domains (Attention, Executive Functioning, Psychomotor Speed, Learning and Memory, Intelligence) were explored as outcome variables using a standard GWAS approach. Two independent

associations achieve genome-wide significance based on Bonferroni correction and these were annotated to the PTPRO and WDR72 genes. Additionally, we looked for interaction in the subset of SNPs with p-value < 3.6 × 10⁻⁷, corresponding to an overall α of 0.2, and found a significant FOXQ1-SUMO1P1 interaction. The findings should be replicated in independent samples, but indicate a role of PTPRO in Learning and Memory, WDR72 with Executive Functioning, and an interaction between FOXQ1 and SUMO1P1 for Psychomotor Speed.

4.2 Paper 2

LeBlanc, M., Zuber, V., Andreassen, B.K., Witoelar, A., Zeng, L., Bettella, F., Wang, Y., McEvoy, L.K., Thompson, W.K., Schork, A.J., Reppe, S., Barrett-Connor, E., Ligthart, S., Dehghan, A., Gautvik, K.M., Nelson, C.P., Schunkert, H., Samani, N.J., CARDIoGRAM Consortium, Ridker, P.M, Chasman, D.I., Aukrust, P., Djurovic, S., Frigessi, A., Desikan, R.S., Dale, A.M and Andreassen, O.A., 2016. Identifying Novel Gene Variants in Coronary Artery Disease and Shared Genes with Several

Cardiovascular Risk Factors. Circulation Research, 118(1):83-94.

The main aim of this paper was to find SNPs/genes associated with coronary artery disease (CAD) using a post-GWAS era approach. Here we used the summary statistics from a large-scale genomic study conducted by CARDIoGRAMplusC4D Consortium together with GWAS summary statistics from eight related

cardiovascular risk factors to improve gene discovery for CAD. The eight risk factors were: type 1 diabetes, type 2 diabetes, high-density lipoprotein, low-density

lipoprotein, triglycerides, C-reactive protein, body mass index and systolic blood