Paper summaries - Computational challenges in family genetics

The papers in this thesis cover several computational problems the modern forensic scientist is or will be faced with. We start by exploring a promising new typing method, high density SNP microarrays, where we genotype more than 900,000 genetic markers on small chips. The theme continues in the subsequent papers where we describe and explore new statistical methods.

The first paper (I) provides a broad motivation for the other papers as we demonstrate the necessity of the research conducted in the following papers (II, III, V and VI). Throughout the articles we illustrate the computational and statistical issues encountered when using closely located genetic markers. Hopefully this thesis and its papers will shed some light on possible approaches to resolve the problems and take advantage of the information inherent in such data. Whereas paper II explores a more experimental approach using object oriented Bayesian networks, papers III, IV and VI describe general implementations of methods to handle issues such as linkage and linkage disequilibrium.

We further consider other issues connected to computational problems encountered in forensic genetics, e.g. mass disaster identification problems, simulations and models for mutations. Paper VI provides new developments for the forensic software Familias [35], implementing ideas to resolve the mentioned issues.

Paper I: DNA microarray as a tool to establish genetic relatedness – Current status and future prospects

There is an increasing interest in using a greater number of genetic markers as more distant genetic relationships are investigated [23, 56]. Paper I uses data from two extended families genotyped on a high density SNP microarray chip from Affymetrix. The chip includes more than 900,000 SNP:s, more or less evenly spread throughout the entire genome. Software such as Merlin [73] is commonly used in medical genetics to study linkage in genetic disorders, but has been less used in forensic genetics.

This paper is one of the first to present the application of the software on real high density marker data to calculate likelihoods for extended relationships in that setting. Previous studies have investigated the use of large sets of markers on simulated data [55], however importantly, as the present paper demonstrates there is an obvious deviation between real and constructed data. The latter is explained by the population genetic phenomena known as linkage disequilibrium, or allelic association. Whereas this can be simulated, no good algorithm to handle the implications in the statistical calculations has been proposed, though one approach is presented by Abecasis et al [103].

29 In summary the paper outlines the general utility of using large numbers of linked genetic markers to solve extended and complex relationships. Moreover, it motivates many future projects.

Paper II: Using Object Oriented Bayesian Networks to model linkage, linkage disequilibrium and mutations between STR markers

The second paper focuses on Object Oriented Bayesian Networks (OOBN) as a tool to model dependency between markers and alleles [61]. Bayesian networks embody the central concept of Markov chains where one node is independent of the rest of the network given the connecting nodes. The paper presents a graphical network created in the software Genie [104], a free tool to visualize networks and easily modify values thereby obtaining the posterior distribution for all other nodes. As an example, the paper uses real data from two STR markers, adopted in regular forensic casework, where dependency between the markers (linkage) and association between alleles (linkage disequilibrium) had been suggested [57, 86]. Subsequent papers demonstrated that the latter could be ignored while the former should be accounted for in statistical calculations. The paper concludes that, although easy to present, the presented model suffer some drawbacks. For example, the network experiences computational problems when calculating the exact posterior distribution for a network given some nodes when a large number of alleles is present, e.g. a typical issue with polymorphic STR markers. Solution for the mentioned difficulties are suggested though a general implementation is not presented and the final conclusion is that OOBN:s may be used for research purposes.

Paper III: FamLink – A user friendly software for linkage calculations in family genetics

As an alternative to the framework presented in Paper II, the paper adapts the functionality and algorithm presented in the software Merlin [75]. Building on the existing computational core, FamLink provides a graphical user interface aimed at forensic users with the interest of calculating likelihoods or simulating linked genetic markers [58]. The paper considers some theoretical approaches to validations and simulations demonstrating the utility of FamLink on a number of cases. The software has since its release been used by a number of laboratories.

Paper IV: A general model for likelihood computations of genetic marker data accounting for linkage, linkage disequilibrium and mutations

The fourth paper builds on the ideas presented particularly in Paper II by presenting a general model to handle dependency between genetic markers in likelihood computations [105]. Similar to the

30 Lander-Green algorithm this new model relies on Markov chains to handle dependency between markers while also including a second multistep Markov chain to handle dependency between alleles across markers. In addition the model can handle data with genetic inconsistencies, i.e. mutations and is therefore specifically suited for forensic purposes. A detailed implementation of the algorithm is described for X-chromosomal marker data. X-chromosomal markers have for a period of time been of great interest in the forensic community [88, 92-95] due to their ability to provide information in several cases where autosomal markers fail. For instance, two half siblings may ask whether they are maternal or paternal half siblings; something which is undistinguishable with autosomal markers.

The paper further continues by demonstrating the utility of the implementation using simulated data as well as some real examples. In summary, the software provides means to solve cases where no previous methods or implementations appear adequate.

Paper V: Familias 3 – Extensions and new functionality

Familias is a software for calculating likelihoods for genetic marker data given some hypotheses about relatedness for a set of persons [79]. The software has long been considered a gold standard in the forensic community, but has lacked some desired functionality [81]. This paper focuses on the new version, with user requests in mind, still keeping the computational core. The new version includes the possibility to handle disaster victim identification (DVI) operations and missing person databases, where large number of unidentified remains is compared against large numbers of reference families. In addition users may now use Monte-Carlo simulations to find distributions of likelihood ratios for any given case. This is particularly interesting in case work, as laboratories may now find out if planned case data is likely to result in sufficiently strong results, i.e. a high likelihood ratio, given the number of genotyped persons. Furthermore, the paper presents a new mutation model dealing with the increasing number of microvariant alleles. In summary the new version presented in the paper provides several new features while still preserving old functionality. The new version of Familias, freely available at www.familias.no, has been developed and coded by the author of this thesis. On a mathematical note, observant readers may note that the mutation

parameterization presented in Section 1.1.3 for the extended stepwise model differs from that presented in Paper VI. A small change has been made, presenting a slightly updated notation herein, to obtain a more consistent model.

31 Paper VI: FamLinkX – Implementation of a general model for likelihood

computations for X-chromosomal marker data

The sixth paper presents a validation of the software FamLinkX. The program implements the model outlined in Paper IV for X-chromosomal marker data. The paper provides ideas to validate and confirm results when using the software to calculate likelihoods. This includes some theoretical considerations as well as simulations and a discussion on choice of parameters. Validation is in general not as straightforward as in other similar programs implementing exact computations, e.g.

Familias and Merlin [35, 75, 81]. Although the calculations in FamLinkX are exact, several parameter choices can influence the results considerably. The simulations provide an idea of the general power of X-chromosomal markers in some common cases in forensic genetics.

In document Computational challenges in family genetics (sider 35-39)