Applications of k-mer count tables in community ecology

(1)

(2)

(3)

Applications of k -mer count tables in community ecology

An example of microbial succession upon glacial retreat

Knut Hauge Engvik

(4)

c

2021 Knut Hauge Engvik

Applications of k-mer count tables in community ecology http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Aknowledgements

I would like to thank my supervisor, Professor Alexander Eiler, for sharing his insights and time so freely in guiding me through this project.

I would like to thank my co-supervisor, Professor Thorbjørn Rognes, for his contribu- tions and comments.

I would like to thank the Department of Biosciences in general, and the AQUA group in particular, for providing an open and welcoming academic environment.

I would also like to acknowledge the contribution of the Australian Microbiome con- sortium in the generation of data used in this publication. The Australian Microbiome initiative is supported by funding from Bioplatforms Australia and the Integrated Marine Observing System (IMOS) through the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS), Parks Australia through the Bush Blitz pro- gram funded by the Australian Government and BHP, and the CSIRO.

In addition, I would like to acknowledge the contribution of the Marine Microbes consor- tium in the generation of data used in this publication. The Marine Microbes project was supported by funding from Bioplatforms Australia and the Integrated Marine Observ- ing System (IMOS) through the Australian Government National Collaborative Research Infrastructure Strategy (NCRIS) in partnership with the Australian research community.

(7)

1 Preface

The study of community composition is at the heart of ecology. How community composition changes from site to site, or how and to what degree it may change over time in a given location, are thus of interest within the field. What factors drive the differences in composition? Are they the result of some stochastic process or are they determined by underlying factors? When trying to answer such question one is immediately faced with two problems. The first of these is to define the fundamental elements of the community. How does one define a community and its structure? After answering this, a second, practical challenge arises. How can one obtain a representative sample of the entire community?

These two questions are difficult for ecologists in general. Within microbial ecology however, these questions pose some particular challenges. Defining the structure of a community, and determining what should be counted as its constituent components, is not a trivial matter. In the macroscopic world, community structures are usually defined around the classical taxonomy, with the various species as the fundamental units of the community. Defining community structures based on species is less satisfactory for microbial communities, as such delineations are spurious within the microscopic world. Instead, community structures are often built around phylotypes and operational taxonomic units.

And while this offers a practical solution for defining communities, the imposition of dis- crete categories on what is essentially a continuum, introduces a degree of arbitrariness.

Horizontal gene transfer does not just muddle the species concept, but to some extent it also divorces diversity from phylogeny, as communities can diverge in important ways without any change in phylogenetic composition.

The more practical problems regarding community sampling also differ for the macroscopic and microscopic ecologists. The gathering of representative samples is arguably easier for the microbiologist, as modern sequencing technology allows researchers to cast a wide net when sampling microbiomes. The challenge lies rather in identifying the catch.

When attempting to analyse the composition of microbial communities, there are, broadly speaking, two ways to go about this, and they can ofcourse be applied in conjunction.

One can attempt to map the sequence reads to databases of known sequences. This has the advantage of potentially yielding very rich information when reads are mapped to well studied, well annotated sequences. Unfortunately this is rare in microbial ecology. Unless the sample is taken from the human gut, chances are the sample will contain a large proportion of unknown species. This leads to a large part of the reads being discarded, as they cannot be mapped to any known sequences. Hopefully, as our knowledge of the microbial world increases, the proportion of unknown species will diminish and a larger part of the sequencing reads can be utilized in downstream analysis. But, as of now, relying purely on this approach is inadequate.

(8)

Rather than relying on existing databases one can attempt to createde novo assemblies or attempt to group raw sequences directly. In a sense, both of these approaches seek to construct an ad hoc taxonomy from the sequences themselves, a taxonomy that can then be used to compare communities. While this to a degree addresses the problems that arise due to the preponderance of unknown microbial organisms, it is not without its own challenges. The assembly step is particularly demanding computationally and some of the data material is still discarded when sequences fail to form contigs or cannot be binned or clustered. In addition, such operational taxonomies only apply to the data set from which they are derived and are not immediately comparable across studies.

The purpose of this thesis is to explore the viability of treating k-mers as the fundamental elements of microbial communities. That is, compare communities by assessing the β-diversity based directly on k-mer counts. An attempt to measure k-mer turnover, so to speak. The main selling point of this approach is that it would be completely un- affected by the degree to which the community under examination consists of known or unknown species, thus circumventing one of the challenges facing traditional approaches.

It also circumvents the need for a taxonomy in order to assess differences between communities, and the entire data set can be used in the comparison of communities. There is another major benefit to this approach: It is cheap. Both in labor cost and the amount of computational hours needed.

Looking at diversity through the lens of k-mer counts is not a new idea. In 2014, a comprehensive study on the performance of various measures of k-mer diversity in clustering and gradient analysis of metagenomes was published (Wang, Liu, Chen, Chen,

& Sun, 2014). A later study by Dubinkina et al. showed that diversity measures based on k-mer abundance data compared favorably with traditional measures based on species abundance (Dubinkina, Ischenko, Ulyantsev, Tyakht, & Alexeev, 2016). Favorable in the sense of yielding comparable dissimilarity values. Later that same year SIMKA was released. This software allows one to calculateβ-diversity based on k-mer counts directly from a set of reads. In the accompanying article the developers report close correlation between β-diversity based on k-mers and those based on phylodistance (Benoit et al., 2016).

It would seem that a natural next step would be to attempt a classic ecological study based on the concept ofk-mer diversity. To that end I will attempt to discern what factors have the largest impact onk-mer turnover in a set of sampled microbial communities. The main focus of the thesis is on a set of communities sampled from several lakes on Svalbard, Norway. These samples constitute four chronosequences and were gathered in order to study microbial succession in the wake of retreating glaciers. One part of this thesis will attempt to discern if there is a pattern of succession among these sites. Additionally, an exploratory analysis of the data will be presented along with a comparison of any findings with current theories on microbial community assembly and the results obtained from a

(9)

more traditional analysis.

A second group of samples was included to provide a test case that is easier to evaluate.

This data set was downloaded from Bioplatforms Australia and comprises samples from diverse environments that are likely to have very different community compositions. For k-mer basedβ-diversity to be of use in ecological analysis, it should be able to tell apart samples from sufficiently different environments.

In summary, the goal of this thesis is fourfold:

• To introduce the concept of measuring diversity via k-mer counts, and to highlight the advantages and limitations of such an approach.

• To do a “field test” by performing an ecological analysis on two sets of metagenomic samples.

• To compare any findings with results obtained from the same data set, but based on sequence alignment to known taxa and orthology groups.

• To evaluate the performance of ak-mer based approach to microbial ecology.

2 Introduction

2.1 On k-mers

The basic unit of analysis in this thesis is the k-mer. K-mers are known by various other names. In natural language processing they are often called n-grams. In biology they can also be encountered under the name l-mers or, when their frequencies are considered, as feature frequencies. Throughout this thesis I will keep to what seems to be the most commonly used name in biology, k-mers. Ak-mer can be defined as follows:

Definition 1 A k-mer over an alphabet A, is any sequence {αⁱ} of length k, such that for all i, αⁱ ∈ A.

By alphabet is meant any finite, non empty set of symbols. In natural language processing, for instance, a typical alphabet would be just that: a typical alphabet. In biology the most common alphabet to consider would be the nucleotides that make up DNA, that is:

A ={A, T, C, G}

Other alphabets in biology would include the set of nucleotides found in RNA, or the amino acids that make up proteins. Throughout this thesisk-mers will refer to sequences over the alphabet of DNA nucleotides. Consider the nucleotide sequences ATTG, CCCA and GTTGACA. The two first would be examples of 4-mers and the latter is a 7-mer.

(10)

The most rudimentary analysis is to count the occurrence of each k-mer in a string or a set of strings. Counting the 2-mers of the two 4-mers and the 7-mer would yield the results in table 1. These counts can be viewed as vectors in the space of all possible 2-mers where each unique k-mer represent one dimension. The k-mer counts for a set of sequences, from now on referred to as a count vector, can be normalized by dividing the count vector by its length to yield a vector of unit length. These unit length vectors will be referred to as frequency vectors. In general the k-mer space over an alphabet A has cardinality:

d=|A|^k.

Thus, with the genetic alphabet, the space of 2-mers is of dimension:

d=|A|^k = 4² = 16.

One peculiarity about counting k-mers in DNA sequences, is that they come in pairs.

When working with raw sequence data, there is no way to tell what strand any particular read stems from. This means that ak-mer is effectively indistinguishable from its reverse complement and there is no reason to track them as separatek-mers. Thus we convert all k-mers to their canonical form prior to counting (Table 2). The canonical representation of a pair of complementary sequences is the first sequence when put in lexicographic order.

When k is even some k-mers will be palindromes. That is, they are their own reverse complement. For this reasonk-mers are usually counted using odd values ofk. The presence of palindromes can cause difficulties for alignment algorithms, but is not important to the comparison of k-mer frequencies per se. It does however lead to an inflation of the space of canonical k-mers. For a k-mer to be a palindrome it must be the concatenation of two l-mers that are reverse complement of each other, with l =k/2. Thus, the space of canonical k-mers for odd k is of dimension:

d= |A|^k 2 while for even k:

d= |A|^k

2 + |A|^k² 2 .

The latter term being due to palindromes not forming canonical pairs.

This means that the number of possible k-mers rapidly become unwieldy. If one were particularly stingy, one could store the presence/absence information for k-mers by an ordered byte string. For k = 5 this would take 64 bytes. At k = 9 it would take roughly 32kbyte. At 15 it would be 62Mbytes and at k = 21 it would be 225Gbyte. If one wanted to also store actual counts and the k-mers themselves, things become even worse. Of course, when k becomes large, most k-mers will not be present in a given set of reads.

Software such as Jellyfish (Mar¸cais & Kingsford, 2011) exploit this fact and employ clever

(11)

Table 1: Non canonical k-mer counts of the sequences GTTGACA, CCCA and ATTG.

2-mers AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT

ATTG 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1

CCCA 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0

GTTGACA 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1

Table 2: Canonicalk-mer counts of the sequences GTTGACA, CCCA and ATTG.

2-mers AA AC AG AT CA CC CG GA GC TA

ATTG 1 0 0 1 1 0 0 0 0 0

CCCA 0 0 0 0 1 2 0 0 0 0

GTTGACA 1 2 0 0 2 0 0 1 0 0

algorithms using probabilistic hash functions to createk-mer tables as needed, drastically reducing the memory overhead, at the cost of minor uncertainty in the counts. Still, the computational and memory overhead fork >13 has led to the decision to focus onk-mers of length 13 or less.

The rationale behind k-mer analysis is that strings with similar counts, or frequencies, share other similarities as well. Suppose one counted all the 5-mers of this paper, using the Latin alphabet. And, after doing so, all the pages were thoroughly shuffled and a subsequent count was performed. A proper comparison of the 5-mer frequencies would reveal similarity between the original and the shuffled pages. In a sense it measures the shared information in the two texts.

Different length k-mers do not necessarily pick up on the same types of information.

The two sentences “Brutus killed Caesar” and “Caesar killed Brutus” have very different meaning, but share a lot of shorterk-mers. Longerk-mers might better pick up the rather important difference of whom killed who. This might differ in an inflected language such as Latin, where sentence structure is less important. The genetic language has several structural layers. The short phrases that constitutes codons certainly has meaning on their own, but the order of these phrases imparts another level of meaning.

Longer structures, such as genes, can move around and retain their meaning, yet gain new context. While the study of k-mer frequencies is not directly concerned with the meaning of particular phrases, the complexity of the language involved indicates that the choice of k-mer length is neither irrelevant nor obvious.

Clearly, longer k-mers contain more information. The sequences from which we get thek-mer counts, are themselvesk-mers of a sort, and any information contained in thek- mer count must be contained in the sequence from whence the counts were obtained. One might then conclude that, ignoring the computational cost, longerk-mers are better. This is not necessarily so. The different lengthk-mers might also highlight different parts of the data. Consider for a moment very longk-mers. On the scale of a genome. Granted, there is no practical possibility of acquiring reads of sufficient length for suchk-mers to appear.

(12)

But it seems clear that such k-mers would be capable of telling apart even individual (non clonal) organisms. So for very large k, the k-mers are highly specific. And for very small k they are very general. It seems possible then, that the different k-mer sizes might have different sensitivities to community structures at different taxonomic ranks.

And while scientific pursuits are naturally focused on the acquisition of information, an equally important part of the process is the discarding of irrelevant information. As such, longer k-mers might be confounding, rather than clarifying.

2.2 Diversity of k-mers

The concept of diversity is central to all spheres of ecology. The concept is however rather elusive. While most would probably have an intuitive understanding of what constitutes diversity, formalizing it remains a challenge. For some collection of elements, any quantity describing its diversity should, ideally, capture the variety of the elements, the heterogeneity of the collection and the multiplicity of each element. Perhaps the most basic measure of diversity would be the number of different elements in the collection.

In biology this is referred to as species richness. More complex indices take into account how evenly distributed the species are or the total abundance of the community. Multiple indices exist, among them Fisher’s alpha (Fisher, Corbet, & Williams, 1943) and the Shannon index (Shannon, 1948) to name a couple.

While these diversity indices to some extent capture the idea of diversity, they fail to incorporate a sense of how the species are distributed across space. Are the various species distributed evenly across the area or do the species composition vary throughout? The more holistic concepts of α,β and γ diversity introduced by Whittaker try to encompass this (Whittaker, 1960). The total diversity of an area, the γ diversity, is then a product both of the local diversity, or α diversity, in sub regions, and the changes in community composition between these sub regions, β diversity. Now one could ask if the α diversity should also be further divided into sub-sub regionαdiversity and theβ diversity between these sub-sub region and so ad infinitum.

Purportedly, the term β diversity is today used in a variety of ways that differ from its original intended purpose. The original intent being a way of quantifying heterogeneity among sub regions. Indeed, some authors have laudably attempted to clear up the confusion (Tuomisto, 2010a), (Tuomisto, 2010b),(Anderson et al., 2011). As far as I can tell, there are two main types of β diversity that are conceptually different. The first type consists of those concept that try to capture Whittaker’s idea of heterogeneity, or what has been termed “true β diversity” (Tuomisto, 2010a). For the moment let us call them β₁. The second type are those measures that are comparative in nature, let us call those β₂. Throughout this thesis I will use the term β diversity in the latter sense, as a measure of difference in composition between communities. In this sense, it is not actually

(13)

a diversity measure, as it is not intrinsic to any given community, but rather depends on what other community one compares it with. While these two types of βdiversity are not the same per se, they are intimately related, and I believe that the differences between the two are overdramatized. The difference between the two concepts is very much the same as the difference between α and γ diversity, where γ diversity is an aggregate of α diversities. If one is examining a single site, then α and γ diversities are the same.

It seems to me that much of the confusion arises from the fact that β diversity has no meaning for singular units. It does not come into play until there are at least two groups being studied. Thus, the fundamental unit for examination of β diversity are pairs, not single communities or groups. And, analogous to the situation forα andγ diversity when a single group is considered, for a single pair of groups, the two concepts of β diversity describe the same kind of information. While they might not agree numerically for a given index, they do agree in what phenomenon they try to estimate. β₁ diversity is an aggregate of β₂ diversity. Thus, the more fundamental concepts are arguably those of α diversity and β₂ diversity.

Some authors seem to agree thatαandγdiversities are essentially the same, what they call inventory diversity, but maintain that there is an important distinction between β₁ (proportional) andβ₂ (differential) diversities (Jurasinski, Retzer, & Beierkuhnlein, 2009).

In my view, the proportional β₁ diversity looks like a dissimilarity measure posing as an inventory index. And, as popular and ingrained as Whittaker’s concept of β diversity is, it is not obvious that it has any inherent value as an inventory index. As previously noted, when the number of sub regions is reduced to one, the concept is meaningless. And while heterogeneity probably should be a part of any “complete” concept of diversity, inventory indices are local while heterogeneity is intrinsically non local. The only way β diversity can be perceived as an inventory index, is as an inventory index of communities. In a sense it measures how many meaningful subgroups there are. A measure of richness of communities, if you will.

Of course there is also the problem of deciding what exactly to measure the diversity of. In a collection consisting of a Labrador, a tiger, a German shepherd and a tabby cat, the diversity would seem to depend on what classification one chooses to apply. Is this a collection of two felines and two canines? Or perhaps one Labrador, one tiger, one German shepherd and a tabby cat. Four mammals? Adding to these challenges is the inherent problem of defining categories in general. While the classical taxonomic ranks introduced by Linnaeus have been improved upon and remain a staple of classification, they are still arbitrary and fuzzy categories. The boundaries for such categories are even less clear for microbes. Fortunately alternative classification methods based on operational taxonomic units and phylogenetic distance do exist (Kembel, Eisen, Pollard, & Green, 2011).

In addition to measuring these kinds of diversity, diversity of type, one could also choose to measure the diversity of function. While many biologists are interested in

(14)

the evolutionary history, others are more interested in how communities function. The interests are of course not mutually exclusive. One could argue that the role a species plays in its community is of greater interest than its genetic history, at least if one is interested in the stability and robustness of an ecosystem.

All this leads to difficulties if one wishes to compare diversity across studies and methodology. And, as difficult as these questions are when one is considering plants and animals, they are doubly so when one is working with the microscopic world. One can of course also consider the diversity of other biological entities such as genes and shorter genetic elements likek-mers. For studying the diversity of microbial communities, diversity of k-mers has, a priori, some very enticing advantages compared to diversity measures that depend on taxonomy or similar categorical structures. These categories are not bounded, in the sense that there is no upper limit to the possible types of species, genes or functions. For any given k however, we know the exact number of possible k- mer. This gives a sense of absoluteness to k-mer based diversity measures, and diversity based on k-mers seem less arbitrary than measures based on categories. Any possible combination of k-mers can be mapped to a point in a closed subset of the unit sphere in in the 2^k dimensional space of k-mers. This is true whether the k-mer stem from community samples, the genome of some species or any other set of genetic sequences.

Thus, any community, species or organism, can be mapped onto a point of the hyper sphere ink-mer space. In a sense, thek-mer frequencies can function as a fingerprints for biological entities, and diversity comparisons across studies would have a more natural interpretation. This geometric view of diversity is, at least to me, more mathematically intuitive when considering how different two communities are, though it is less intuitive biologically.

The most enticing reasons for looking atk-mer diversity however, are practical. Firstly, when one uses the entire set of sequence reads, unknown species contribute as much to the diversity as known species. This is particularly useful when studying communities that are less well known, where the majority of species might be unknown or not present in any assembly databases. Horizontal gene transfer, mutability and in general the plasticity of microorganisms, imply that changes in species composition is not the only way in which communities can diversify, and k-mer based diversity measures might better be able to detect such effects.

Secondly, the cost and computational challenges are minor compared to those incurred by traditional methods, particularly if one is employing de novo assembly to ameliorate the effects of sparsely populated sequence databases. In short, ak-mer based approach to diversity allows one to use a larger proportion of the raw data at a reduced computational cost.

There are also several downsides with k-mers. Firstly, it is more abstracted from our intuitive concept of diversity. It is not obvious what it means that a community is diverse

(15)

in k-mers, nor is it obvious that it should be interesting. At best, it would seem, k-mer diversity is a proxy for what we are really interested in: the diversity of various taxa.

Thus, there is a problem in interpreting the results obtained by comparing k-mer counts.

For a pair of diverse communities it can potentially show that something is different, but just what constitutes that difference, in an ecological sense, is not clear. Secondly, while any community maps to a unique point in k-mer space, it does not map uniquely. That is, the map is not one to one. Consider, for instance, the short sequences AT AAT and T AAT A. These sequences have identical 3-mer counts: T AA, AAT, AT A. This means that two communities with very different species compositions might potentially have a very small β diversity when looking at k-mer frequencies. Thirdly, there is no obvious way to assess k-mer abundancy, as this would seem to depend solely on sampling effort.

Sequencing bias is another potential challenge. Sequencing techniques suffer from various biases related to primer sensitivity and DNA extraction methods (Martin-Laurent et al., 2001). In general, any bias introduced prior to analysis would be as much of a problem for k-mer analysis as for other analytic methods based on sequencing data. So, while k-mers are sensitive to unknown species, they cannot detect species that failed to be sampled in the first place.

The first of these concerns can be somewhat alleviated, as the k-mer diversity seem to correlate well with classical concepts. One study by Dubinkina et al. evaluated the correlation between beta diversity based on shortk-mers, five to thirteen nucleotides long, and more traditional methods. For simulated genomes the correlation was high (0.88, p 0.001) while for real metagenomes correlation varied between 0.43 and 0.8 (Dubinkina et al., 2016). The variable and somewhat low correlation observed when studying actual meta-genomes might indicate that this k-mer strategy is somehow lacking, but it might also be due to the k-mer based analysis picking up signals from species that have yet to be categorized.

That thek-mer diversity should correlate so well with species based diversity measures is perhaps not obvious, but neither is it entirely unexpected. Related genomes are more likely to be similar, and similar genomes are more likely to yield similark-mer frequencies.

That there should be correlation with functional diversity however, seems much less plausible. Surprisingly the aforementioned study found that the correlation with functional diversity was even higher.

The problem of interpreting the result from a comparison of k-mer frequencies, should not dissuade us from employing the method. Other methodologies facing similar difficulties in interpretation has, non the less, proved to be useful tools (Marzorati, Wittebolle, Boon, Daffonchio, & Verstraete, 2008). One technique uses denaturing gradient gel elec- trophoresis (DGGE), whereby DNA sequences can be separated by their melting point along a concentration gradient of some denaturing agent. Communities can then be compared by such DGGE fingerprints (Torsvik, Goksøyr, & Daae, 1990) Another, rather

(16)

similar, approach is based on the buoyant density of DNA (Holben & Harris, 1995). Both the buoyant density and the melting point of DNA sequences, depend to a large degree on its GC-content. And though it is usually not referred to as such, GC content is just a 1-mer frequency. It could be argued that the comparison of k-mer frequencies between communities has, by proxy, already been employed to determine compositional changes within microbial communities. The main point is however, that quantifying the degree to which communities differ can be of interest, even if the distance measure itself is not of immediate interest.

The second challenge is hopefully not too severe, due to the vast number of k-mers when k is sufficiently large. The probability that two communities that are markedly different in species composition should have very similark-mer distributions therefore seems, a priori, to be low. The third problem, obtaining meaningful measures of total abundance, could be examined through other means thank-mer analysis. Total abundances could be potentially be approximated using biomass estimations or with techniques such as flow cytometry or spiking with known species (Tkacz, Hortala, & Poole, 2018).

As noted in the beginning of this section: We all have an intuitive notion of what diversity is. And the subject is obviously of interest to many biologists. To study a subject rigorously, precise definitions are important, indeed I would argue a prerequisite.

Scientists therefore create operational definitions for such everyday terms as diversity.

Unfortunately, it is a common thing to see these operational definitions, after incubating in the field for a number of year, reemerge as putatively true definitions of the original concept. All academic fields seem to fall into this trap. This does in no way mean that such definition should be avoided, as they can be very useful. But, we should remember that such operational definitions are tools, proxies for the original concept, not a replacement.

And, while I suspect that the concept ofk-mer diversity will feel alien to many ecologists, I believe it does capture much of what is the central theme for our intuitive concept of diversity: The variety of life. Whether or not one is open to the idea of k-mer diversity itself as an interesting concept, it can still be embraced as a tool, if it can be shown to have predictive or descriptive power.

2.3 k-mer size, sample size and dissimilarity measures

Both sample size and k-mer size demand trade offs between accuracy and computational costs, and there is a noteworthy interaction between the two. Particularly the demands on memory rise rapidly with k and the number of reads to be analyzed. This is also the case for disc space, if the counts are to be stored. For a givenk, the memory requirements of a k-mer count table are more or less saturated when all unique k-mers have been encountered. Any increase in memory requirements from that point on, would be due to a need to change the integer type of the count table. Going from 16 bit integer counts

(17)

to 32 bit, for instance. For small k, when most k-mers are encountered after few reads, increasing the sample size further is not very costly. For large k, new uniquek-mers will still be encountered at larger sample sizes. This has two implications: The cost of a large sample size is much higher for largek, and if one wishes to obtain a representative sample, the need for larger sample sizes is greater for large k. Indeed, the rate of increase in the memory needed for the count table is likely a good proxy for whether or not the sample size is sufficient to obtain a representative sample.

The choice of appropriate dissimilarity measure may also depend on the size of k. For ak-mer length of five, the k-mer space is filled out for most samples. Similarity based on presence/absence data would therefore not give any meaningful measure. As k increases, so does the probability of detecting unique or rarek-mers. A shift towards measures that are more sensitive to differences in the presence and absence of k-mers, might therefore be more appropriate for large k. There is also a difference in the computational cost of various dissimilarity indices. Low cost alternatives, such as Bray-Curtis or cosine dissimilarity, might be preferable to the more computationally expensive measures such as the Jensen-Shannon distance.

Three different diversity measures were used in this thesis. Firstly the Bray-Curtis dissimilarity (Bray & Curtis, 1957). This is a much used dissimilarity measure in ecology and its applicability to k-mer counts was shown, in the previously mentioned study by Dubinkina, to correlate well with traditional diversity measures. For communities A and B with corresponding count vectors a and b, the Bray-Curtis index is given by:

BC(a, b) =

2^k

X

i=1

|aⁱ−bⁱ|

|aⁱ+bⁱ|

It should be noted that the Bray-Curtis dissimilarity measure is usually based on the count vectors and as such is sensitive to the differences in total abundance between sites and variations in sampling effort. Ask-mer abundance is, in practice, impossible to determine, the relative abundances should be used for k-mer analysis. One reason the Bray-Curtis dissimilarity is popular in ecology is that it integrates presence-absence data and the total abundances of each community. When k is small, all k-mers tend to be present and so the sensitivity to presence-absence is irrelevant. As the count for individual k-mers are normalized by total k-mer count, the sensitivity to total abundance also becomes irrelevant. The good news is that when based on frequency vectors it is a proper metric.

Indeed, when thus normalized, the Bray-Curtis dissimilarity is equivalent to the not so popular L₁ metric. The L₁ metric, often referred to as the Manhattan distance or city block distance, is given by:

L₁(a, b) =

2^k

X

i=1

|aⁱ−bⁱ|

The Manhattan distance is often regarded as an inappropriate measure in community

(18)

ecology. This is due to the so called double zero problem. If some species is equally abundant in two communities, the Manhattan distance treats this case in the same way whether the abundance is zero in both communities or both communities have the same non zero abundance. This is not a concern here however, as this problem only applies to counts, not to frequencies.

The second measure used was the Jensen-Shannon distance (Endres & Schindelin, 2003), a metric based on the Jensen-Shannon divergence (Topsoe, 2000). This metric was originally proposed as a distance measure for comparing probability distributions. It has since been co opted by biologists, and employed to measure β diversity. The vectors of k-mer frequencies can be viewed as probability vectors, where the probability of drawing a particular k-mer is estimated by its relative abundance in the sample. It would seem then, that this metric is a good candidate for the analysis. For communities A and B with corresponding frequency vectors a and b, the Jensen-Shannon distance is given by:

JS(a, b) =

rD(akm) +D(bkm)

2 ,

where,

m = a+b 2 and, for frequency vector x and y,

D(xky) = X

i

xⁱlog xⁱ

yⁱ

, is the Kullback-Leibler divergence (Kullback & Leibler, 1951).

The third dissimilarity measure used was angular distance. This is a proper distance measure, like the Jensen-Shannon distance, and adheres to the triangle inequality. While, as far as I can tell, this is not used in ecological studies, a related measure based on the cosine similarity has been applied to determine phylogeny based on protein sequences and their associated k-mers (Stuart, Moffett, & Baker, 2002). Another related measure is the chord distance introduced by Orl´oci (Orloci, 1967). Measures based on cosine similarity is common in natural language processing. It is used to indicate the similarity of two texts, for instance in order to detect plagiarism, as in (Shahmirzadi, Lugowski, & Younge, 2018).

One can think of reproduction as a form of plagiarism. As such it seemed like a potentially useful metric. This measure is also easier to conceptualize. It is just the normalized angle between the frequency vectors, and it can be calculated from the cosine similarity. The cosine similarity of vectors a and b, is given by:

CS(a, b) = P

iaⁱbⁱ

|a||b| . The angular distance is then:

AD(a, b) = 2· arccos(CS(a, b))

π ,

(19)

where the 2 rescales the possible distances among purely positive vectors to the interval [0,1]. I chose to use the angular distance rather than the more common cosine dissimilarity for mainly two reasons. First of all, as noted, it is a proper metric and so the triangle inequality holds. This is usually not considered important for diversity indices, but it seems to me that it could affect ordination results. The triangle inequality is also a tacit assumption underlying much of my intuition regarding differences. The angular distance also differs from the cosine dissimilarity in that it varies evenly. Both the cosine dissimilarity and the angular distance measure an angle, but the cosine dissimilarity changes slowly at dissimilarities close to 0 or 1, while it changes rapidly in the mid region.

This makes the dissimilarity less intuitive and it seems plausible that it could affect ordination methods.

For a dissimilarity measure to be of any use, it must necessarily be able to differentiate between communities. Consider two microbial communitiesAandB. Suppose, then, that A is the set of all genomes in communityA andB is the set of all genomes in community B. Counting thek-mers inA andB would, after normalization by totalk-mers, yield the respective relative abundances, or k-mer frequencies, a and b.

Let us consider some dissimilarity measure D(A, B), whereD7→[0,1], with 1 indicating maximal dissimilarity, that is, no k-mer overlap, and 0 meaning the k-mer frequencies are identical. If one were to draw samples from a community, and were able to measure the distance to the idealized community vector, the distance should be small. That is, the measure D(a, aⁱ) should ideally be close to zero. Similarly, for samples bⁱ taken from community B, the distanceD(a, bⁱ) should be close toD(a, b). What is considered close would depend on the measure, but to discriminate between the samples, the following must hold:

D(a, bⁱ)−D(a, aⁱ)>0.

It is not the absolute difference that is most important here. If there is uncertainty in the measurement, say the standard deviations for D(a, aⁱ) and D(a, bⁱ) are σ^a and σ^b respectively, then to reliably differentiate the two communities we must have

D(a, bⁱ)−D(a, aⁱ)−x(σ^a+σ^b)>0,

wherexdepends on what exactly is considered reliably. I will refer to a distance measures ability to discriminate communities reliably as discriminatory capacity.

2.4 Evaluating the viability of k-mer diversity as a basis for ecological analysis

It is not immediately obvious how one should go about evaluating the success or failure of a k-mer based approach when analysing microbial communities. One could directly compare the dissimilarities obtained from k-mer analysis with dissimilarities acquired through traditional approaches. The Mantel test (Mantel, 1967) and Procrustes test

(20)

(Jackson, 1995) can be used to test the correlation between distance matrices. The problem with this approach is that, barring perfect agreement between the matrices, there is no guarantee that the downstream analysis will be in agreement, even if the distance matrices are highly correlated. As such, high correlation on the Mantel and Procrustes tests can be seen as a necessary, but not sufficient requirement to deem the k-mer approach a success.

One could also compare the results from complete ecological analysis based on the different dissimilarity matrices. The main challenge with this approach is interpreting any disagreement among the methods. If both methods are in agreement, this would give credence to the k-mer based analysis. If, on the other hand, the results disagree, we cannot conclude that the results of the k-mer analysis is wrong in any strict sense.

Such a test is therefore, by itself, a rather weak test, as it can only lend support to the method. One solution to this problem, is to do a controlled test. This could be done by creating mock microbial communities and analyse these using both k-mer based and classical methods. In such a simulation one knows what species are present and what the outcome of the analysis should be. This approach was used in the cited study by Dubinkina (Dubinkina et al., 2016) to compare dissimilarities obtained using k-mers and dissimilarities obtained from reference based methods. Using mock communities was considered for this project, but simulating sufficiently complex communities, with realistic environmental dependencies, was deemed to difficult.

Another criterium one could use for evaluation, is whether or not the results agree with ecological theory. This is also, on its own, a rather weak test. There are very few clear cut answers in microbial ecology. There is quite a bit of interpretation involved in the analysis of microbial communities. Composing a plausible explanation after the fact is usually possible for any significant result observed. There are obviously many factors that can influence how microbial communities change from one area to another, or indeed from one time to another, and there does not seem to be a consensus on what are the primary drivers of diversity.

One way to minimize this problem is to look at low complexity questions were the underlying theory is less contentious. For instance, one could compare microbial communities from very different environments, such as soil and pelagic communities. If the method cannot differentiate these communities, it would be a rather harsh blow against it.

In order to evaluate the viability of using k-mer based diversity measures as a basis for ecological analysis, two different data sets have been examined in this thesis. The first of these datasets was analysed in parallel with an ongoing study employing classical ecological methods. This allows for a direct comparison of the dissimilarities obtained via k-mer analysis and the dissimilarities based on count tables obtained from the aforementioned study. Additionally, a further comparison of the results from the downstream analysis

(21)

can be made.

The second dataset is a collection of 389 metagenomic samples collected from the Pacific Ocean and the coast of Australia. These samples are a mix of pelagic, coastal and sediment samples. These groups of samples should, a priori, differ quite a bit in community composition. As such, this data set provides an opportunity to test if k-mer based diversity measures can distinguish samples originating from different systems. It also serves to demonstrate the scalability of a k-mer based approach.

3 Methods

3.1 Initial evaluation of dissimilarity indices

In the early stages of this project, I wanted to determine what would be a sufficient number of reads for a givenk-mer length, and whether or not smaller values of k (k <11) could differentiate between samples. I also wanted to test different dissimilarity measures, and determine the interactions between various choices of k-mer size, sample size and dissimilarity measure.

To determine whatk-mer length and sample size would be sufficient to differentiate the samples, the discriminatory capacity of the three dissimilarity measures were examined for a range ofk-mer lengths and sample sizes. We never have access to the actual community vectors, but it is reasonable to assume that samples of sufficient sequencing depth will be good approximations, only limited in accuracy by sampling techniques and sequencing technology. Two samples, SV001 and SV002, were selected from the Svalbard data set to serve as stand ins for actual community vectors. From sample SV001, twenty two sub samples were randomly drawn for each of the following sample sizes: 10k, 20k, 50k, 100k, 200k, 400k, 800k and 1600k reads.

Half of the sub samples were drawn from the R1 strands and half drawn from the R2 strands using the following algorithm: Let N be the sample size, X the strand number and R be the number of reads for sample SV001. For each sub sample, a set of numbers random numbers, {nⁱ}, were drawn, without replacement, in the range [1, R]. For each nⁱ, read number nⁱ was then added to the sub sample. K-mers were then counted for each sub sample using a self written python script.

The k-mers for the entire metagenomes for samples SV001 and SV002 were counted using Jellyfish (Mar¸cais & Kingsford, 2011). Ideally the same software should have been used in both instances, but my code was poorly optimized for the large data sets and integration of Jellyfish into the scripts for drawing, organizing and counting the sub samples proved difficult. K-mers were counted fork = 5,7,9,11 and 13. For each sub sample the angular distance, Jensen-Shannon distance and Bray-Curtis dissimilarity to SV001 and SV002 were calculated. Thus, for each sub sample, SS, and each dissimilarity mea-

(22)

sure,Dⁱ, we have two dissimilaritiesDⁱ(SS, SV001) and Dⁱ(SS, SV002). These distances were examined to give an idea of how the within sample distances and between sample distances are affected by sample size, k-mer length and choice of dissimilarity measure.

3.2 The Svalbard dataset

3.2.1 Data acquisition and preprocessing

The Svalbard data set examined in this thesis is part of an ongoing study focusing on microbial succession in the wake of glacial retreat. Of the original 30 samples, only the 25 samples from Svalbard were studied in this thesis. The original sample set was gathered from a collection of 30 lakes, 25 of which were located at Svalbard (Kungsfjorden and Longyearbyen), with the remaining 5 located in mainland Norway (Finse). The 30 samples constitute a total of five glacial chronosequences. Lakes were selected with minimal anthropogenic influence and the distances to the glaciers spanned a range from 0 to 10 km. Samples were taken during summer 2019 (8th to 11th of August at Svalbard and 23rd of August at Finse). Some study sites (SV001-003, and SV007) were close to human settlements and highly impacted by birds.

For DNA sequencing, between 0.2 and 1.2 liters of water was filtered through 0.2µm Sterivex cartridges (Millipore), in duplicate, by using sterile syringes followed by on-site liquid nitrogen freezing until further analyses. Total DNA was extracted using a DNesay PowerWater Sterivex Kit (Qiagen, Germany).

Libraries were prepared from 20 nginput DNA using the Nextera DNA Flex reagents (Illumina) according to manufacturer’s instructions, with 8 cycles PCR and unique dual index adapters. Library concentrations were determined by quantitative PCR using KAPA reagents (Roche), and a single equimolar pool of all libraries prepared. Sam- ples F I016 and SV011 did not give any libraries with acceptable quality, and thus these samples were excluded from metagenome sequencing. Sequencing was performed on an Illumina NovaSeq S4 flowcell with 150 bp paired end reads.

The environmental variables pH and conductivity were excluded from the analysis, even though both of these factors are assumed to to have a significant impact on community composition. For the samples gathered near Longyearbyen, samplesSV001, SV002, SV003 and SV004, data was lacking for pH and conductivity. As the number of sites sampled was only 25, the variables were ignored and the sites kept in the study. For the samples with available data the conductivity was on the low end (<500) except for sampleSV020 which had a conductivity of (1410). The pH values for all lakes fell in the range 7.5 to 9.0.

Measurements for both total and dissolved phosphorous were not sensitive enough to discern differences below 2µg/L. Values below this cutoff were set to 0.01 so normalization and regression methods could be applied. The measurements for total nitrogen (T N)

(23)

showed some anomalies as for some sites the dissolved nitrogen (DN) registered higher values than than T N.

The preprocessing was done separately, and by different people, for thek-mer analysis and the alignment based analysis. The KEGG orthology method used the following pro- cedure: For the standard metagenomic analysis a fully automatized pipeline SqueezeMeta (Tamames & Puente-S´anchez, 2019) was used, including quality filtering using Trimmo- matic (Bolger, Lohse, & Usadel, 2014), individual sample read assembly with Megahit(Li, Liu, Luo, Sadakane, & Lam, 2015) and mapping with Bowtie2 (Langmead & Salzberg, 2012). Annotations were performed against COG (Tatusov, Galperin, Natale, & Koonin, 2000), KEGG (Kanehisa, Sato, Kawashima, Furumichi, & Tanabe, 2016) and Pfam (Finn et al., 2014) databases.

For the k-mer analysis, the following preprocessing was performed: The read quality for all samples were inspected using FastQC (Andrews et al., 2010) and the overall read quality was high, with average quality per read over 35 for all samples. All reads were subjected to trimming. Nextera adapter sequences were removed and reads with quality below 30 dropped using the software Cutadapt/trimgalore (Martin, 2011). After trimming there were still some over represented k-mers in the first 10 positions. I suspect these are artefacts due to random priming during shotgun-sequencing. These were not removed due to time constraint. An ad hoc test for the potential interference of these suspected artefacts was performed. The sub samples used to assess discriminatory capacity had their k-mers recounted, starting at base pair 11. The distances obtained from these new counts were then compared with the old distances. The comparison of distance measures based onk-mer counts with a 10bpoffset with distances measured fromk-mer counts with no offset showed no discernible difference.

The chemical environmental data total organic carbon (T OC), dissolved organic carbon (DOC), total phosphorous (T P), dissolved phosphorous (DP), total nitrogen (T N) and dissolved nitrogen (DN) were log transformed prior to analyses, as suggested in (Palmer, 1993). The remaining environmental variables (temperature, altitude, distance to glacier) were not transformed. K-mers for the full metagenomes were counted using Jellyfish (Mar¸cais & Kingsford, 2011). Self written python scripts were used to draw sub samples, count the k-mers for the sub samples and to calculate k-mer based distances.

K-mers counted by Jellyfish were converted to a local format and all k-mers containing failed base calls (N) were removed prior to distance calculation.

3.2.2 Data analysis

One common approach to analysing community structure is through ordination methods.

After calculating the distances between samples from their k-mer counts, it seemed like the best place to start was with a distance based redundancy analysis (dbRDA), a con-

(24)

strained ordination method introduced by Pierre Legendre and Marti Anderson (Legendre

& Anderson, 1999). In short this method is a combination of (multiple) linear regression and ordination. The dissimilarity matrix is first subjected to a principle coordinate analysis (Gower, 1966) and the coordinates of the samples in the space of principle coordinates are then used as the dependent variables in a linear regression on selected environmental variables.

The regression model was built with the following independent variables: dissolved nitrogen (DN), total nitrogen (T N), dissolved phosphorous (DP), total phosphorous (T P), dissolved organic carbon (DOC), total organic carbon (T OC), altitude, temperature and distance to glacial front. The models were tested for possible multicollinearity using vif.cca from the R package Vegan (Oksanen et al., 2008). The highest variance inflation factor reported was for dissolved nitrogen (V IF = 6.15) indicating some degree of collinearity, but not to a severe degree.

Following constrained ordination, the ordination results were subjected to permutation based analysis of variance (Legendre, Oksanen, & ter Braak, 2011). Significant (p <0.05) results of the ANOVA test would then prompt further analysis to find the contribution of individual environmental variables. This analysis was initially carried out for k= 11 and k = 13 using the full data set. The analysis was then repeated on sub samples of 100k reads from each of the samples for k = 5,7,9.

In addition to the constrained ordination, an unconstrained ordination using non metric multidimensional scaling was performed. This was done using metaMDS from the R package Vegan. The environmental data was then fitted to the resulting ordinations using envfit, also from Vegan. The analysis was run for the Jensen-Shannon distance, Bray-Curtis dissimilarity and angular distance based on 11-mers and 13-mers using the full set of reads.

In order to assess the results from the k-mer analysis it is important to have some reference to compare it against. The samples analyzed in this thesis were, as noted, part of an ongoing metagenomics project. While this project was not completed at the time of writing, some preliminary results were available. The sequencing reads had been mapped to and grouped by KEGG orthology classes (Kanehisa et al., 2016), and the number of transcripts per million reads (TPM) for each sample were available. The TPM gives, for each feature, the number of reads mapped to that feature in a random sub sample of one million reads. The term TPM is somewhat of a misnomer in this case, as we are not counting transcripts. The term has its origin in RNA sequencing, but the method can just as well be employed in DNA sequencing. The reads had also been mapped to NCBI databases and classified by taxonomic ranks, and the number of reads mapped to each taxa detected were available.

Non metric multidimensional scaling, using the R package metaMDS, was performed on the orthology TPM table using defaults settings. The ordination was thus based on

(25)

the Bray-Curtis distance. The environmental variables were then fitted to the ordination result using envfit. Both algorithms are contained in the R package Vegan(Oksanen et al., 2008). As for the initial k-mer NMDS and dbRDA, the environmental variables included were TN, DN, TP, DP, TOC, DOC, temperature, altitude and distance to glacier.

The taxonomic count table, based on mapping reads to NCBI databases, was sorted into three new tables. One table for each of the taxonomic ranks domain, order and genus.

For each rank, reads that were mapped to viruses or failed to map to any NCBI entries at the given rank, were discarded. On average 67% of reads were discarded for each sample at the domain rank, 83% at order rank and 91% at genus rank. These data tables were then subjected to redundancy analysis (van den Wollenberg, 1977) on the environmental variables. This was done using the rda function from the Vegan package

Fitting the environmental data to the NMDS ordination could just as well have been based on the taxonomy data. Likewise, the RDA analysis could have been performed using the KEGG table. In order to reduce the workload, the two approaches were split between the orthology and taxonomy based mapping results, rather than performing both analysis on both data sets.

Dissimilarities obtained via KEGG orthology or NCBI mapping were also compared directly with the k-mer based dissimilarities. In order to make a direct comparison, the Bray-Curtis dissimilarity was calculated between each sample pair, based on the number of reads (TPM) mapped to each orthology class. Bray-Curtis dissimilarities were also calculated based on count of taxa at the domain, order and genus ranks. The result were compared with dissimilarity measures based onk-mer counts using the Mantel test (Mantel, 1967). Comparisons were made for both Jensen-Shannon distances and Bray-Curtis dissimilarities, calculated from the 11-mer counts of the metagenomes. An additional comparison was done using a Procrustes test (Jackson, 1995).

3.3 Dataset from Australian Microbiome Initiative

3.3.1 Data aquisition and pre processing

The samples from the Ausstralian Microbiome Initiative were included to give a more straight forward test of the viability of using k-mer based diversity measures to answer ecological questions. Can the method differentiate between samples from notably different environments?

In order to answer this question a total of 389 metegenomic samples were downloaded through Bioplatforms Australia and The Australian Microbiome Initiative. The collection consists of 20 samples from coastal water, 20 sediment samples and 349 samples from pelagic zones. The sediment and coastal water samples were all taken collected near Sydney and Perth while the pelagic samples where collected from a range of locations near the coast of Australia and the Pacific ocean. While the precise sampling protocol

(26)

was not available, all samples included were reported to us the same library construction protocol Illumina DNA Prep (previously Nextera DNA Flex Library Prep).

The database was initially searched manually, looking for metagenomic samples from different environments and appropriate keywords to use in the planned programmatic search. The manual inspection of the database led to the choice of basing the initial search on the term ”data type:MGE”.

A self written python script was then used to search the database and automatically download metagenomes, trim the metagenomes, countk-mers and calculate distance matrices. Trimming was done via Cutadapt/trimgalore (Martin, 2011) and k-mers counted Jellyfish (Mar¸cais & Kingsford, 2011).

While the number of samples available through Bioplatforms Australia is large, the annotation of the samples is not standardized across sampling initiatives. Unfortunately this contributed to the number of samples found for each sample type (pelagic, sediment, coastal water) being heavily skewed, with roughly 90 percent of the samples being pelagic. Attempts were made to find more non pelagic samples, but samples either lacked information on sampling environment, were inaccessible or based on different sequencing techniques such as 16S amplification.

3.3.2 Data analysis

The angular distance, Jensen-Shannon distance and the Bray-Curtis dissimilarity was calculated between each pair of the 389 samples. The distance matrices obtained were used as basis for a principal coordinate analysis (Gower, 1966) using cmdscale from the stats package in R (R Core Team, 2021). To determine if there was a significant difference between the groups a permutational ANOVA was performed. The permutational ANOVA was performed using adonis from the Vegan package in R (Oksanen et al., 2008).

3.4 Evenness of k-mers

While the main focus of this thesis is onβ-diversity, one can also consider various measures of alpha diversity ofk-mers. There are three types of alpha diversity one usually consider.

Richness, total abundance and evenness. As noted earlier, it is not feasible to get an accurate measure for the total abundances for k-mers, as it is a mainly governed by sampling effort. The k-mer richness might be of interest for longer k-mers, but for the short k-mers used in this analysis, the entirek-mer space is occupied (k <= 11). Evenness on the other hand, can be assessed.

One can think about evenness as the degree to which a community deviates from the community structure of a perfectly even community. Due to the fact that the number of k-mers are bounded for a given k, the concept of a perfectly even community makes sense and has a natural definition: A perfectly even community has ak-mer frequency of ₄²^k for

(27)

each k-mer. Thus, a measure of evenness could be based on the beta diversity between a sample and such an idealized community. As the k-mer frequencies can be viewed as point on a hypersphere, much of this thesis is focused on Svalbard, it seems apt to refer to this theoretical, perfectly even community as a pole. The beta diversity, D , between the pole and a sample is thus a measure of dominance, and 1−D is then a measure of evenness.

Five indices of k-mer evenness were calculated for the samples from Svalbard: Bray- Curtis dissimilarity with the pole, the Jensen-Shannon distance to the pole, angular distance to the pole, Pielou’s index and Simpson’s evenness index.

To see if the evenness of k-mers have an immediate biological interpretation, the k-mer indices were compared with Pielou’s evenness index calculated from taxonomic classification of reads.

4 Results

4.1 Discriminatory capacity

There were some notable differences between the different dissimilarity measures. The angular distance seems to converge more rapidly than both the Jensen-Shannon distance and the Bray-Curtis dissimilarity The angular distance also showed a larger discriminatory capacity in absolute terms. The variance was however rather high and grew with increasingk-mer length. The Bray-Curtis dissimilarity and Jensen-Shannon distance had substantially lower variance compared with the angular distance. And, when measured in standard deviations, the separation of the distances is more than 30 times greater for the Bray-Curtis and Jensen-Shannon measures (sample size 100k). The angular distance might be a comparatively poor choice of distance measure for longer k-mers. The results are summarized in table 3 and figure 1.

When comparing distances calculated from the same strand, the variability displayed by the angular distance was reduced considerably, as can be seen in figure 2 and table 3. Sub samples from the same strand were then compared directly, showing very little difference between the sub samples. The angular distance was (0.050), Jensen-Shannon distance 0.050 and Bray-Curtis dissimilarity 0.052, calculated on two sub samples of 1600000 reads. Distances calculated between strands yielded an angular distance of 0.369, Jensen-Shannon distance of 0.052 and a Bray-Curtis dissimilarity of 0.054.

This is unexpected, as there is no biologically relevant directionality inherent in the reads, andk-mers were registered as the lexicographically smallest of the complementary kmers. I can only think of two reasons for the observed behavior. Either there is some error in the code used to draw sub samples or calculate the distances, or there is some bias in adaptor ligation or due to the non complementary primers used in the sequencing.

(28)

That only the angular distance seem to differentiate between the strands, might be an indication that there is some error in the calculations.

Table 3: Discriminatory capacity. The table shows statistics based on distances measured between the k-mer frequencies of random sub samples drawn from SV001 and full metagenomes of both sample SV001 and SV002. Each sub sample consisted of 100kreads and k-mer length was k = 11. Top two rows shows the average and standard deviation of distances between sub samples from SV001 and the whole metagenome of SV001. In rows three and four the sub samples from SV001 were compared to the metagenome of SV002.

Row five gives an indication of discriminatory capacity in absolute separation between the samples. Row six gives the separation in number of standard deviations. The lower part of the table shows the same information based on sub samples from strand R1 only.

Angular D. Jensen-S. Bray-C. Sample pair Avg. Dist 0.22048 0.15251 0.14958 SV001

Std 0.04538 0.00017 0.00017 SV001

Avg. Dist 0.44335 0.22556 0.23595 SV001

Std 0.05484 0.00049 0.00061 SV002

Disc 0.22287 0.07305 0.08636

DiscStd 4.064 149.4 141.6

Avg. Dist 0.26240 0.15250 0.14956 SV001

Std 0.01240 0.00014 0.00011 SV001

Avg. Dist 0.49640 0.22576 0.23621 SV001

Std 0.00985 0.00050 0.00061 SV002

Disc 0.234 0.0731 0.0864

DiscStd 18.86 145.7 141.5

(29)

(30)

(31)

4.2 The Svalbard dataset

4.2.1 Distance based redundancy analysis

For k = 11 and using the angular distance metric, the results of the redundancy analysis showed no significant linear dependency (F = 1.28, P r(> F) = 0.157) between the explanatory variables and the ordination axis. Further analysis was therefore aborted.

Using the Bray-Curtis dissimilarity measure and k = 11, the ANOVA result indicated a significant model (F = 1.61, P r(> F) = 0.007). The proportion of inertia explained by the constrained axes was reported to be 0.49, where inertia is the sum of squared distances in the ordination space. This prompted further ANOVA tests for each explanatory variable. The most significant factors appear to be DOC , (F = 2.68, P r(> F) = 0.026), T N(F = 2.16, P r(> F) = 0.034), DN(F = 2.38, P r(> F) = 0.021), distance to glacier F = 2.11, P r(> F) = 0.037).

The model based on the Jensen-Shannon distance, for k = 11, was highly significant (F = 1.53, P r(> F) = 0.004). Significant explanatory variables were DOC(F = 1.81, P r(> F) = 0.041), DN(F = 2.84, P r(> F) = 0.002) and distance to glacier (F = 1.83, P r(> F) = 0.042). Proportion of inertia due to canonical axes (0.48). Results of the permutation based, term wise ANOVA tests are summarized in table 4

The analysis was repeated based on 13-mers to assess whether or not an increase in k-mer size would yield different results. Using the angular distance, the model was not found to be significant (F = 1.14, P r(> F) = 0.294). The model was significant when using both the Jensen-Shannon distance and the Bray-Curtis dissimilarity, (F = 1.60, P r(> F) = 0.002) and (F = 1.64, P r(> F) = 0.002) respectively. Proportion of explained inertia for the Bray-Curtis dissimilarity was 0.50 and for the Jensen-Shannon distance it was 0.49. Both measures resulted in a highly significant results for DN, indeed, no more extreme F values were found for any of the 10000 permutations. There was also better agreement in regards to the remaining variables, with both measures reporting similar significance for all variables. Results of term wise ANOVA tests for k= 13 can be found in table 5.

The same approach was used in analyzing sub samples of 100000 reads from each lake, to test if smaller sample sizes would yield analogous results. The results were similar to the results obtained from the full set of reads, though with slightly higher significance: Bray- Curtis (P r(> F) = 0.004), Jensen-Shannon (P r(> F) = 0.003). The angular distance showed a marked higher significance level (P r(> F) = 0.052), but as the full data set showed no significance this should probably only serve as a warning to not put too much stock in one significant result. The preliminary tests on the effect of sample size and k-mer length for differentiating samples, see figure 1, indicate that sufficient sample size decreases with shorter k-mers. This, together with the congruence between the results obtained from the redundancy analysis of sub samples and full samples, suggest that a

Applications of k-mer count tables in community ecology