** Materials and methods**

**3. Materials and methods 1. Samples**

**3.5. Statistical analyses**

**3.5.1. Intra-population variability 1. Allele frequencies**

Allele frequencies are the relative frequency of a particular allele of a specific locus in a determined population. In population genetics, allele frequencies are used to describe the amount of variation at a particular locus in a population.

For Indel, Alu, and STR markers, allele frequencies were calculated using Arlequin v.3.5 software (Excoffier and Lischer, 2010). In X-chromosome markers, allele frequencies of males and females were calculated separately, and then total frequencies were estimated using the following formula for each allele in each marker:

𝑝_{𝑖} = (2 ∗𝑓𝑒𝑚𝑎𝑙𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑦) + 𝑚𝑎𝑙𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
3

**3.5.1.2. Hardy-Weinberg Equilibrium (HWE) **

Once we know the allele frequencies of a population, the proportions of the genotypes in the succeeding generation by combining gametes at random can be predicted through the postulate of the Hardy-Weinberg principle.

In our studies, the HWE and p-values were calculated using Arlequin v.3.5 software (Excoffier and Lischer, 2010). In X-chromosome markers the calculations were made by taking into account only the female data.

In statistical significance testing, the p-value measures how the observation compares with the expectation. The null hypothesis can be rejected when the p-value is less than significance level α, which in our case is 0.05. When the null hypothesis is rejected it can be said that the results are statistically significant.

51
**3.5.1.3. Diversity parameters **

**3.5.1.3.1. Gene diversity (GD) **

This is equivalent to the expected heterozygosity for diploid data (X-chromosome and autosomal markers). It is defined as the probability that two randomly chosen haplotypes are different in the population (Nei, 1987).

𝐺𝑒𝑛𝑒 𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦 = 1 − ∑(𝑝_{𝑖})^{2}

𝑛

𝑖

where pi is the allele frequency of each allele in the sample.

This was performed using only female data in X-chromosome markers and all the samples for Autosomal Indels.

For multiallelic markers, diversity parameters were estimated using Arlequin v.3.5 (Excoffier and Lischer, 2010).

**3.5.1.3.2. Haplotype diversity (HD) **

Haplotype diversity is a measure of the uniqueness of a particular haplotype in a given population. It is defined as the probability that two randomly chosen haplotypes are different in the population. This parameter is equivalent to gene diversity in haploid markers. It was calculated using Arlequin v.3.5 (Excoffier and Lischer, 2010).

Haplotype diversity is computed as Nei and Tajima (1981):

𝐻𝐷 = 𝑁

𝑁 − 1 (1 − ∑(𝑥_{𝑖})^{2}

𝑛

𝑖

)

Here, xi is the (relative) haplotype frequency of each haplotype in the sample, and N is the sample size.

This was calculated for male samples in X-chromosome markers. It was also calculated in the X-STRs for the four different linkage groups (LGs) included in the Investigator Argus X-12 kit, and for X-Indels for the markers that were shown to be linked in the linkage disequilibrium analysis. It was also performed for all the samples studied for Y-chromosome STRs, and mtDNA.

52
**3.5.1.3.3. Mitochondrial DNA diversity **

In mtDNA analyses additional parameters were calculated: a) K (number of different haplotypes); b) S (number of polymorphic sites); c) π (nucleotide diversity), which is the probability that two randomly chosen homologous (nucleotide or RFLP) sites are different, which is equivalent to gene diversity at nucleotide level for DNA data; and d) Theta (θ).

Theta is a fundamental parameter of molecular evolution that encapsulates the expected level of genetic diversity in a randomly mating, constant-sized population not subject to selection when an equilibrium is reached between genetic drift and mutation. It is defined as:

θ = 2nNeµ,

where n is the number of heritable copies of the locus per individual (0.5 in the case of mtDNA), Ne is the effective population size, and µ is the mutation rate per nucleotide (or per sequence) and per generation (Nei, 1987; Tajima, 1993).

There are different ways to estimate θ from sequence data; depending on the parameter used the estimator is called θs (using the number of polymorphic sites), θK (using the number of different haplotypes), etc. The Theta estimator based on the number of different lineages (θK) (Tamura and Nei, 1993), which is based on the relationship between sample size and the number of distinct lineages, is more sensitive to the effects of lineage sorting during recent demographic history.

**3.5.1.4. Neutrality tests **

There are many methods to detect selection, which typically calculate a statistic that compares a feature of the observed diversity to that expected under neutral evolution.

In this work Tajima's D test of selective neutrality (Tajima, 1989; 1993)

### –

which compares the number of segregating sites per site with nucleotide diversity### –

was estimated in mtDNA analysis.This test compares two estimators of the population parameter. Under the infinite-site model, both estimators should estimate the same quantity, but differences can arise under selection, population non-stationarity, or heterogeneity of mutation rates among sites. The test statistic D is defined as:𝐷 = 𝜃̂𝜋 − 𝜃̂𝑠

√𝑉𝑎𝑟 (𝜃̂𝜋 − 𝜃̂𝑠)

53
**3.5.1.5. Linkage disequilibrium (LD) **

Linkage disequilibrium is an estimate of recombination at a population level. LD measures whether speciﬁc alleles at different loci are correlated with one another more or less often than would be expected by chance (Jobling et al., 2014). LD is influenced by many factors, including evolutionary forces and also population characteristics (mating system, population substructure…). Thus, the pattern of LD is a powerful tool to understand past evolutionary and demographic events in human history or in the history of a particular population.

Exact test of LD was calculated for haplotypic data (Y-chromosome, and in males in X-chromosome markers) using Arlequin v.3.5 software (Excoffier and Lischer, 2010) performing 1000000 steps in Markov chains and 1000 steps of demorization.

**3.5.1.6. Forensic parameters **

To test how efficient the recombinant markers used in this work (STR, Alu and Indel) are for forensic purposes, a series of parameters was calculated.

Polymorphism information content (PIC) (Botstein et al., 1980) and expected heterozygosity (formula equivalent to gene diversity) are devised for more general purposes and are valid for both autosomal and X-chromosome markers.

𝑃𝐼𝐶 = 1 − (∑ 𝑝_{𝑖}^{2}
parameter is not suitable for X-chromosome markers except for deficiency cases in which
the paternal grandmother is investigated instead of the alleged father.

𝑀𝐸𝐶𝐾𝑅Ü = ∑ 𝑓_{𝑖}^{3}(1 − 𝑓_{𝑖})^{2}+ 𝑓_{𝑖}(1 − 𝑓_{𝑖})^{3}+ ∑ 𝑓_{𝑖}𝑓_{𝑗}(𝑓_{𝑖}+ 𝑓_{𝑗})

𝑖<𝑗

𝑖 (1 − 𝑓_{𝑖} − 𝑓_{𝑗})^{2}

Kishida et al. (1997) devised a MECKIS for X-chromosome markers that covers trios including a daughter. If MECKRÜ is compared to MECKIS, the latter is considerably

54

larger. This highlights the fact that in trios involving a daughter, X-chromosome markers are more efficient than autosomal markers.

𝑀𝐸𝐶𝐾𝐼𝑆 = ∑ 𝑓_{𝑖}^{3}(1 − 𝑓_{𝑖}) + 𝑓_{𝑖}(1 − 𝑓_{𝑖})^{2}+ ∑ 𝑓_{𝑖}𝑓_{𝑗}(𝑓_{𝑖} + 𝑓_{𝑗})

𝑖<𝑗

𝑖 (1 − 𝑓_{𝑖}− 𝑓_{𝑗})

𝑀𝐸𝐶𝐷_{𝑡𝑟𝑖𝑜} = 1 − ∑ 𝑓_{𝑖}^{2}+ ∑ 𝑓_{𝑖}^{4}− (∑ 𝑓_{𝑖}^{2}

𝑖<𝑗 )

2 𝑖

𝑖

𝑀𝐸𝐶𝐷_{𝑑𝑢𝑜} = 1 − 2 ∑ 𝑓_{𝑖}^{2}+ ∑ 𝑓_{𝑖}^{3}

𝑖 𝑖

Finally, Desmarais et al. (1998) introduced formulae for the mean exclusion chance of X-chromosome markers in trios involving daughters (MECDtrio) and in father/daughter duos lacking maternal genotype information (MECDduo). MECDtrio is equivalent to MECKIS whilst MECDduo is also appropriate for maternity testing of mother/son duos.

Power of discrimination in Females (PDfemale) and power of discrimination in males (PDmale) are parameters suitable to assess the power of markers for forensic identification purposes in males and females, respectively.

𝑃𝐷_{𝑓𝑒𝑚𝑎𝑙𝑒} = 1 − 2 (∑ 𝑓_{𝑖}^{2}

𝑖 )^{2} + ∑ 𝑓_{𝑖}^{4}

𝑖

𝑃𝐷_{𝑚𝑎𝑙𝑒} = 1 − ∑ 𝑓_{𝑖}^{2}

𝑖

Here, fi (fj) are population frequencies of the i^{th} (j^{th}) marker alleles.

For autosomal Indels, PowerStats formulae (Brenner and Morris, 1989; Jones, 1972) were used to calculate forensic parameters. For the match probability (MP) and power of discrimination, the formulae are the following:

𝑀𝑎𝑡𝑐ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦= ∑ 𝐺_{𝑖}^{2}

𝑛

𝑖=𝑎

55

𝑃𝑜𝑤𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛= 1 − ∑ 𝐺_{𝑖}^{2}

𝑛

𝑖=𝑎

where G1 is the fraction of samples with genotype “i”.

To determine the power of exclusion (PE) and typical paternity index (TPI or PI) the following formulae were used

### :

𝑃𝑜𝑤𝑒𝑟 𝑜𝑓 𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑜𝑛= ℎ^{2}(1 − 2 ∗ ℎ ∗ 𝐻^{2})

𝑇𝑦𝑝𝑖𝑐𝑎𝑙 𝑝𝑎𝑡𝑒𝑟𝑛𝑖𝑡𝑦 𝑖𝑛𝑑𝑒𝑥 =(𝐻 + ℎ) 2𝐻

where h defines the number of heterozygotes and H is the number of homozygotes. These formulae are valid for both autosomal and X-chromosome markers.

To test forensic efficiency in Y-STRs, discrimination capacity (DC) was calculated as the percentage of different haplotypes and haplotype match probability (HMP) as 1-haplotype diversity.

**3.5.2. Genetic structure and Inter-population variability **