• No results found

1.4 Association studies

1.4.2 eQTL mapping

Expression quantitative trait locus (eQTL) mapping is related to the traditional GWAS, as just described, but the phenotypes here are of the more abstract kind, namely gene expression. The problems of GWAS get even bigger for this type of association study since not only do we have a large number of genetic variants, we also have a large number of phenotypes. The phenotypes in this case are measures of gene expression for every transcribed gene. If we are to consider the expression of every gene in the human genome together with all the genetic variants in the genome, we have to perform approximately 30 billion tests. Not only does this results in a multiple testing problem, but it also causes purely computational problems. Not too long ago, this many tests would have been practically impossible to perform due to the computational resources needed, but with the increase in computational power, coupled with clever methods [61], this is now relatively easy to do.

QTL mapping is typically divided into two categories: linkage mapping and association mapping. Linkage mapping is usually used when family information is available, such as in a controlled cross. It relies on known markers and operates by performing a cross and observing how genetic markers associate with changes in the trait of interest. For the work described in this thesis we have instead used natural populations of plants, for which we do not have family information and where a naturally breeding collection of individuals are considered, rather than a controlled cross between two individuals. This approach is referred to as association mapping, or linkage disequilibrium mapping. This method is related to GWA in that a large number of genetic

markers (typically SNPs) are statistically tested to determine whether they are significantly associated with variation in a phenotype; the phenotype in this case being gene expression levels. Linkage disequilibrium (LD) is the non-random association between different loci. The idea is that the SNPs used for the association are in LD with the factor that is actually responsible for the phenotype. This way, the causal variant itself does not necessarily have to be included in the association, as long as a variant that is in LD with it is included.

eQTLs can be classified as either local or distant. A local eQTL is close to the gene that it is associated with while a distant one is far away, either on the same chromosome or on a different chromosome than the associated gene. The distance threshold where local becomes distant is however somewhat arbitrary.

In our eQTL analysis in paper III, we classify SNPs within 100 kilobases from the transcription start site to be local, based on the distance distribution of eQTL on the same chromosome as the associated gene. The division into local and distant is a purely structural one as opposed to a functional definition. A more functional definition also exists, where eQTLs are classified depending on how they act on the associated gene. eQTLs are said to act either incis or in trans, withcis-eQTLs acting directly on gene expression whiletrans-eQTL act indirectly on the associated gene. An example of a cismechanism could be a variant that modifies a transcription factor binding motif in the promoter of a gene, while atrans effect could be something so subtle as affecting the abundance of a certain co-factor that is required for expression of the associated gene. Consequently, a cis-eQTL should act in an allele specific manner. If a transcription factor binding site gets disrupted in only one allele, only the transcription of that allele will be affected. Conversely,trans-eQTLs will have the same effect on both alleles. Due to the indirect mechanism oftrans-eQTL, these are generally of lower effect (remember the slope from figure 5), and this is something that has been reported by numerous studies (references in [62,63]), although there are exceptions [64]. Normallycis acting variants are local to the associated gene whiletrans effects are more distant. Some studies opt to only consider local eQTL, like [65], and this is to some extent a tactical decision in that it makes the computational problem a bit easier since fewer

G1 G2 G3 R1

R3

R2

Q1

Q2 Q3

Q4

Q1 Q2 Q3 Q4 R3 G1 G2 G3 Genotype Expression

Figure 6: Simplified example of when eQTL effects and gene regulation is masked. A green checkmark means the regulatory link is enabled, while a red cross means it is disabled. Green arrows indicate up-regulation of the gene while a red arrow indicates down-regulation of the gene. In the regulatory network, the regulators R1 and R2 are always on, while regulator R3 is on as long as at least one of the eQTLs Q2 or Q3 enables the signal. The expression of G1 only depends on Q1, and this eQTL is thus easily detected by standard eQTL mapping methods since there is a perfect relationship between the genotype and the expression. Due to the dual regulators and eQTLs for R3, there is no perfect relationship between the eQTLs Q2 and Q3 and either R3 or G2. The regulation of G3 is even more complicated where R3 needs to be expressed, and at the same time Q4 must enable the signal. No perfect relationship between G3 and any of the eQTLs exist even though Q4 iscis-acting and Q2 and Q3 are bothtrans-acting.

tests have to be performed, and consequently, the multiple testing problem becomes slightly less of a problem since the number of markers considered for each gene is much smaller than the total number of markers.

The first study of the genetics underlying gene expression variation was per-formed in yeast in 2002 [66] and included 3,312 genetic markers and 6,215 genes.

At the time this was a big feat, but today we are able to run association tests for all genes in the genome and all markers as demonstrated by the human Genotype Tissue Expression project (GTEx; [65]) with a total of about 6.8 million SNPs and using both coding and non-coding genes (53,934 genes in total).

1.4.2.1 Biology gets complicated quickly

Complex traits are the result of the interactions between many different factors.

When it comes to eQTLs, the most common approach is to consider pairs of genes and genetic variants one by one. A better approach would be to analyse combinations of genetic variants and how they affect gene expression in concert. However, it is not possible to do this in an exhaustive manner due to computational complexity and multiple testing. In figure 6 a simple example of how the regulation could be hidden from traditional analysis methods is shown.

The gene G1 is perfectly correlated with the genotype of the eQTL Q1, and thus the traditional approach is perfectly capable of detecting this relationship.

It does not take much before this becomes too complicated though. R3 is dependent on two eQTLs, Q2 and Q3. The expression of R3 is not perfectly correlated with neither Q2 nor Q3, but in combination these eQTLs fully explain the expression of R3. In other words, a model that takes all pairs of SNPs into account would be needed to detect this relationship. Since G2 is directly regulated by R3, the dissection of G2 would need the same model as R3. Finally, G3 could only be dissected if all triplets of SNPs were taken into account. This is a very simplified example, but it highlights the inherent difficulties of systems genetics. In paper III we work with about 3.2 million SNPs and about 20,000 genes resulting in about 64 billion models. This would be able to capture the expression of G1. In order to dissect the expression of R3 and G2 we would need to create models using all pairs of SNPs against all genes and this would result in 1.02×1017 models. The expression of G3 is explained by three eQTLs, and in order to test all SNP triplets, we would have to investigate 1.09×1023models. Assuming that we are able to calculate 10 million models per second—which is about the same speed as we achieved in paper III—computing all models for pairs of SNPs would take more than 300 years, and all models of SNP triplets would take more than 340million years.

Moreover, this is not even the worst part since the ridiculous number of tests would need a correspondingly strict correction for multiple testing. In order for any effect, no matter how large, to be significant, an enormous amount of sequenced and phenotyped individuals is needed. This can be viewed as the Catch 22 of genomics, where we have biological complexity on one side and

limited data availability and computational power on the other.

Machine learning is class of methods that can be used in order to identify patterns in large data sets. In paper II we use a support vector machine (SVM) approach to classify samples as male or female based on gene expression. Omics data have a dimensionality problem with a large number of variables (e.g. genes) compared to the number of observations. An SVM will very likely perform very well on this kind of data, but it will not generalise, i.e. new observations will not be classified with a very high accuracy. This can be alleviated somewhat by limiting the model using cross-validation, but instead the model will likely have a bad performance for all data instead. In order to use methods like this, the data must be limited to smaller data sets with a higher signal to noise ratio.

As seen in figure 6, the complexity of regulation often results in redundancy in the regulatory network, redundancy that can act as a buffer for random mutations [67]. Here gene duplications play a role as well since with two copies of the same gene, any detrimental mutations to one of them will most likely not affect the organism in a drastically negative way. Not only does this protect the organism, but it can also hide the regulatory mechanism from traditional analysis methods. One way to think of this is that simplicity would be bad for biology in general. If something is easy to disentangle, then a very small perturbation, like a mutation, could possibly disrupt the whole system. This is part of why we, in paper III, hypothesise that genes that are central in the co-expression network have evolved more redundancy in their regulation. By having more redundancy, these genes will not be affected as easily by random mutations, and this is the same idea underlying the hypothesis of scale-free biological networks (section 1.3.3).