The transcriptome of Gadus morhua and its discordance with the genome

(1)

The transcriptome of Gadus morhua and its discordance with the genome

How does enrichment of short tandem repeats influence the alignment?

Kari Jerve Ramsøy

Master of Science Thesis

Genetics and developmental biology 60 credits

Department of Biosciences

The Faculty of Mathematics and Natural Sciences UNIVERSITY OF OSLO

September 2020

(2)

(3)

The transcriptome of Gadus morhua and its discordance with the genome

How does enrichment of short tandem repeats influence the

alignment?

(4)

Title: The transcriptome of Gadus morhua and its discordance with the genome Author: Kari Jerve Ramsøy

Supervisors: Ole Kristian Tørresen (IBV) Kjetill Sigurd Jakobsen (IBV) Eivind Hovig (IFI)

https://www.duo.uio.no

(5)

Abstract

The transcriptome assembly of Atlantic cod has previously been obtained in order to investigate the quality of the genome assembly and uncover genes that have not yet been annotated. However, it was observed that a considerable number of transcripts aligned partially to the genome assembly, and others, not at all. Further, the Atlantic cod genome is highly enriched with short tandem repeats (STRs), compared to most other vertebrates. STRs are known to be challenging to sequence correctly and difficult for mapping tools to align.

The aim of this thesis was to analyze transcripts in the transcriptome of Atlantic cod that were partially mapped and transcripts that showed no sequence similarity to the reference genome.

Two different mapping tools, minimap2 and GMAP, were used, in order to investigate which performed the best in aligning genomic data highly enriched in STRs. The density of STRs in partially mapped transcripts were obtained, and enrichment of specific biological processes among these transcripts were analyzed. Differences in the alignment of STRs between the two mapping tools were compared. It was searched for sequence similarity of some non-mapped transcripts to all sequenced species available, and specifically to other species of fish. Further, it was looked for synteny regions in Atlantic herring and Zebra fish, with the aim of localizing unannotated genes in the Atlantic cod genome with help of unmapped transcripts. The results show that partially mapped transcripts had a significantly higher density of STRs than

transcripts that had an almost complete alignment to the genome assembly. The number of base pairs in STRs likely affected if a transcript were aligned. Generally, both tools showed limitations in aligning repetitive segments, but minimap2 was considered to be the mapping tool that was most capable of aligning to genomic data enriched in STRs. Additionally, some transcripts with no sequence similarity to Atlantic cod showed sequence similarity to species of zooplankton. They could likely be contaminants in the sample before sequencing. Other transcripts showed sequence similarity to other species of fish, but not to Atlantic cod, which may indicate that these genes are missing from the Atlantic cod genome assembly. The nup155 gene was annotated in Zebrafish and Atlantic herring, but not in Atlantic cod. It was found synteny in the genes surrounding nup155 between the three genomes, therefore the annotation of this gene is suggested to be on chromosome 4 region 4:11,929,091 - 12,047,953 in the gadMor3 genome assembly of Atlantic cod. These findings are important when the transcriptome is used as a tool for further research on the genes in Atlantic cod. Furthermore, it is vital to be aware of the weaknesses these mapping tools show when aligning STRs, and this should be taken into account when using the transcriptome for annotating genes.

(6)

(7)

Acknowledgements

This thesis was written at the Center for Ecological and Evolutionary Synthesis (CEES), at the department of Biosciences, University of Oslo, under the supervision of Ole Kristian Tørresen, Kjetill Sigurd Jakobsen and Eivind Hovig.

I would like to thank my main supervisor Ole. Thank you for sharing all your knowledge, for always having a positive attitude and being available during the whole period whenever I had questions. I am truly grateful for all your help with the thesis! I would also like to thank my co-supervisors Kjetill and Eivind for giving me the opportunity to work on this project and for your valuable guidance along the way.

Thanks to fellow students and friends inside and outside of academia for interesting

conversations, coffee-breaks and fun times. Thanks to my mother for always believing in me.

To my brother and sister for being great role models, and my sister for a critical review of the thesis. I would also like to thank my dad. Even though you could not be here for this moment, I know you would have been very proud. You are my main inspiration, both academically and personally.

Last but not least, thanks to my son Jacob, for being the most wonderful little human, and to Omar, for being my cliff in life, and for all your patience throughout the whole period.

Kari Jerve Ramsøy, September 2020

(8)

(9)

1 Introduction

Transcriptome assemblies are often obtained together with genome assemblies when

evaluating the quality of the genome assembly and annotating genes. They are also important tools in many other analyses, as for example in analysis of differences in gene expression of different tissues (Moreno-Santillan, Machain-Williams, Hernandez-Montes, & Ortega, 2019).

When the transcriptome is assembled, and a high-quality reference genome assembly is available, one would expect most of the transcripts to map to the reference genome sequence, since the transcriptome represents the translated and spliced regions of the genome

(Freedman, Clamp, & Sackton, 2020). Transcripts that do not map, can indicate fragmentations of either the genome or transcriptome assembly.

When the transcriptome of Atlantic cod was assembled and mapped to the genome assembly of Atlantic cod, it was discovered that a great number of transcripts mapped only with parts of the sequence, and that the percentage of aligned bases per transcript varied greatly. The genome assembly of Atlantic cod have a very high density of short tandem repeats (STRs), both in coding and non-coding regions (Torresen et al., 2017). It is known that repetitive stretches of DNA and RNA can often be prone to errors at multiple levels during the sequencing and assembling processes (Torresen et al., 2019). This can lead to differences between corresponding transcripts and genome sequences on base pair level, which will cause problems when aligning the transcriptome and genome assemblies. Furthermore, different mapping tools can often give quite different results when aligning a given transcriptome to a genome assembly because their algorithms for alignment of sequences and resolving

mismatches work differently (Krizanovic, Echchiki, Roux, & Sikic, 2018). Hence, the choice of tools is crucial when preforming a transcriptome-genome alignment.

1.1 Atlantic cod

Atlantic cod (Gadus morhua) has been vital for settlements along the coast of Northern Europe and North America, and the cod fisheries date back centuries. Today, Atlantic cod is the most commercial and economic important wild caught fish species for the Norwegian society. In 2018 the value of catch was 6.9 billion NOK. Comparatively, the secondly ranked fish species, Atlantic mackerel, had a value of catch at 2.4 billion NOK (Baklien, 2019).

(12)

Figure 1: Red areas show the geographical distribution of Atlantic cod (FAO, 2020).

The population of Atlantic cod is widely distributed on both sides of the Atlantic Ocean. Its habitat stretches from Cape Hatteras to Ungava Bay along the North American coast, east and west coast of Greenland, around Iceland and around the coasts of Europe, from the Bay of Biscay to the Barents Sea, including the region around Bear Island (Bjørnøya) (Figure 1).It is mainly a demersal fish, but its habitat may become pelagic when feeding or spawning (FAO, 2020). The two major populations of Atlantic cod in Norwegian waters are the North East Arctic cod (NEAC), commonly known as “skrei”, and the Norwegian Coastal cod (NCC).

The NEAC population is the most important for Norwegian and worldwide fisheries of Atlantic cod (Star et al., 2011).

1.2 Previous research on the genome of Atlantic cod

In 2011, the first genome assembly of Atlantic cod was published (Star, Nederbragt et al.

2011). The assembly was obtained from high-throughput 454 pyrosequencing. This led to the discovery of some unique characteristics in its immune system. At that point, such

characteristics had not been found in any other vertebrates. The assembly showed lack of the genes MHCII, CD4 and Ii, which are genes that are essential for the pathway associated with the major histocompatibility complex (MHC) class II (Star et al., 2011).

(13)

sequencing. The long-reads obtained with PacBio made it possible to provide an assembly where the sequence contiguity was improved fifty-fold and the proportion bases in gaps had been reduced fifteen-fold (Torresen et al., 2017). The improved assembly showed a high density of tandem repeats (TR), also in exons and promotor regions, which were not so evident in the first assembly due to fragmentation. Heterozygous TRs in the sequenced individual was estimated, and analysis indicated that at least one-fifth of the TRs was heterozygous. Since the rate of heterozygosity in TR is so high in one single individual, it is assumed that this variation is also seen on a populational level. The high abundance of TRs, with an elevated mutation rate, in coding and regulatory regions of Atlantic cod does most likely have profound impact on the Atlantic cod population and their local adaptions, when seen in an evolutionary perspective (Torresen et al., 2017).

In 2019, a third genome assembly, even more improved than the second assembly, was generated. A second generation of the PacBio platform (Sequel 2), which generate longer reads, was used. In addition to this, 10XGenomics linked reads, Arima HiC and Bionano optical maps were used to further optimize the genome assembly. With 10XGenomics, Illumina short-reads are obtained, followed by linking of the reads by using molecular

barcodes to tag reads that come from the same DNA fragment.¹ Arima HiC also sequences the DNA sample with Illumina sequencing. Previous to this, a proximity ligation sample

preparation method detects spatially close DNA interactions in cells and uses chromosome folding properties to capture biological properties. This information can be used to assign contigs to chromosomes and determine their order and orientation along the chromosome.² Optical mapping provided by Bionano uses restriction-maps to generate images of long DNA molecules in their native state and shows the actual orientation of the genome. It is ideal to use in addition to sequencing methods for improvement of genome assemblies (Sousa et al., 2019). Literature addressing the latest genome assembly version have not yet been published, but it is assumed to be of higher quality then the previous assemblies, due to the more

advanced sequencing and assembling platforms that have been used.

1.3 Sequencing and assembling of the Atlantic cod transcriptome

The transcriptome of Atlantic cod was sequenced with Pacific Biosciences Single molecule, Real-time (SMRT) long-read isoform sequencing (Torresen et al., 2017). This sequencing

1 Information about 10XGenomics linked reads was retrieved from: https://www.10xgenomics.com/blog/a-basic- introduction-to-linked-reads

2 Information about Arima HiC was retrieved from: https://arimagenomics.com/assembly

(14)

method can generate full length transcripts, from 5’ to 3’end, which was not possible for short-read technologies, such as the Illumina platform (Zhao et al., 2019). With long-read technology one is more likely to generate reads that span trough whole repetitive regions.

When flanking regions on both sides of a repeat are successfully sequenced it is possible to assemble reads to its exact location. One is then more likely to obtain a full-length high- quality transcriptome. Long-read sequencing have therefore to some extent resolved one of the major problems with short-read sequencing (Treangen & Salzberg, 2011). In addition to assembling repetitive sequences, long-read sequencing methods has also minimalized the introduction of sequencing errors in the form of GC-bias, chimeric reads, and repeat variation length, which are often introduced during PCR amplification in the library preparations (Ardui, Ameur, Vermeesch, & Hestand, 2018).

1.4 Short tandem repeats (STRs)

Tandem repeats are found in DNA, RNA and amino acid sequences. Their pattern of base pairs is repeated in tandem, and they are put in different categories depending on the length of the repeating unit. Short tandem repeats (STRs), also known as microsatellites, consists of repeated units of one to nine base pairs. Those with a repeating unit of ten to several hundred base pairs are often called minisatellites (Duitama et al., 2014). Mono-, di-, tri-, and

tetranucleotides are most abundant (Ellegren, 2004). STRs also vary in number of iterations of the repeating unit (Mayer, Leese, & Tollrian, 2010). In this project, the focus was on STRs that occur in coding DNA and in RNA sequences. The Atlantic cod genome is highly

enriched with STRs, and the most abundant repeated unit length is di-repeats, which make up 48.7% of all annotated repeats in the second version of the Atlantic cod genome assembly (Torresen et al., 2017).

Tandem repeats and other repetitive regions of DNA were in many years labeled as

nonfunctional “junk” DNA and were often ignored. Repeat variation was often considered as neutral and did therefore not result in any phenotypic changes (Hartl, 2000). Today the view on repetitive regions in genomes and transcriptomes has changed, and we know that tandem repeat length variation may have a considerable effect on the expressed phenotype (Gymrek et al., 2016).

1.4.1 The mutation rates of short tandem repeats

(15)

genome (Gemayel, Vinces, Legendre, & Verstrepen, 2010; Torresen et al., 2017). They mutate by adding or removing full repeat units. The mutation rate varies among loci and among alleles, but the main factors that affect the mutation rate is the length and the purity of the STR. The mutation rate increases with an increasing number of repeat units (Ellegren, 2004), and with increased purity of the repeat (Gemayel, Cho, Boeynaems, & Verstrepen, 2012). However, due to the high variability in mutation rate among different STRs, these cannot be the only factors that determines this rate. Other factors, such as effect of flanking sequences, transcription-coupled repair, chromatin structure and local point-mutation-rate variation do likely also affect the mutation rate (Ellegren, 2004).

1.4.2 The mechanism behind STR expansion and contraction

Several mechanisms behind how STRs in the genome contract and expand are discussed in the literature. However, the precise molecular mechanism of repeat rearrangements is still somewhat unclear. Over the years of research on STRs, strand-slippage replication are the most favored theory for how length changes in STRs occur (Ellegren, 2000).

In the event of strand slippage replication, the repeat can either expand, contract or stay the same. The process starts with DNA polymerase traversing the DNA strand during replication, and the two DNA strands dissociates. At the location of a repetitive element, if a repeat on the newly synthesized strand incorrectly rehybridizes with a complementary repetitive sequence downstream on the strand, a loop will be formed on the new strand. The newly synthesized sequence will then be longer than the template sequence, and the repeat has expanded (Figure 2). Conversely, if the incorrect alignment occurs upstream along the template strand, the new strand will become shorter than the template sequence, and the repeat will contract (Figure 2) (Ellegren, 2000). However, most of the mutations are corrected by the mismatch-repair system, and only a small fraction ends up as an STR mutation event (Ellegren, 2004).

(16)

Figure 2: Replication slippage can lead to either expansion or contraction of the repeat, depending on where the loop is formed. If the loop is formed on the newly synthesized strand (upper strand in the figure), the repeat will expand. If the loop is formed on the template strand (lower strand in the figure), the repeat will contract (M. A.

Jobling, 2004).

Researchers have been arguing that other mechanisms related to recombination, such as unequal crossover and gene conversion, play a role in short tandem repeat rearrangements.

However, most tests show no correlations between STR mutability and recombination rate.

For example, there is no evidence for systematic differences in the rates and patterns of microsatellite mutations between autosomal and Y-linked markers (Ellegren, 2004). The Y- chromosome is not involved in meiotic recombination, but this does not seem to influence the mutation process of STRs when comparing it to autosomal chromosomes, that are involved in meiotic recombination. Hence, STRs mutations are likely not a consequence of recombination (Ellegren, 2004).

1.4.3 Consequences of STR length variability

In vertebrates, most STRs are found in intergenic regions of DNA. Mutations occurring in these regions are less affected by selection, and the frequency and distributions of them should reflect the underlying mutational processes of STRs (Ellegren, 2004).

(17)

STRs located in coding and promotor regions of genes are of great interest, because

mutational events can lead to a change in phenotype, and they also increase the evolvability of proteins. They are often highly polymorphic within species, which are assumed to affect the evolutionary variation in phenotypic traits (Vinces, Legendre, Caldara, Hagihara, &

Verstrepen, 2009). If a mutation occurs in an exon, it will change the expressed amino acid sequence, which can lead to changes in the function of the encoded protein. The consequence of the mutation depends on the type of STR that mutate, the length of the expansion, and the location of the mutational event in the open reading frame. Trinucleotides, hexanucleotides and nonanucleotides are units of STRs that will not alter the reading frame if they mutate. In exons, trinucleotide repeats are most common, and there are several examples where

mutations of such repeats lead to neuropathological disorders in humans (Ellegren 2004).

Huntington’s Disease is one of the most thoroughly studied tandem repeat disorders. The disease arises when a CAG trinucleotide repeat located in the huntingtin (HTT) gene expands to 36-180 number of base pairs, where 6-34 is the number of base pairs in a non-pathological gene. The CAG base triplet encodes for the amino acid glutamine, and the STR expansion leads to an increase in the length of a polyglutamine (polyQ) tract in the huntingtin protein.

Expansions of such polyQ-tracts, above 36 base pairs, leads to the degradation of neurons in the brain (Hannan, 2018). If a mutation of an STR that is not multiples of three occur, it would change the whole reading frame of the gene (Usdin, 2008). Depending on the gene, it might not be compatible with life of the individual. In an evolutionary perspective there is therefore a strong selection against such repeat units in amino acid coding regions (Ellegren, 2004).

On the other hand, if a repeat expansion or contraction occur in the regulatory regions of genes, it will not affect the function of the encoded protein directly. Instead, mutations in STRs in such locations can lead to change in the chromatin structure, or affecting

transcriptional regulation, translation, and splicing in different ways (Usdin, 2008). For example, if an STR located in a promotor region expands, it can affect the binding of transcription factors, and further affect the expression of the associated gene (Sawaya et al., 2013).

Even though we know that STRs play an important role when it comes to the expression of many phenotypic traits, a lot is still unknown about their function and evolutionary effects.

This field of research is in constant and rapid change, and after long-read high-throughput sequencing technologies became available, and more advanced bioinformatic tools that can

(18)

handle repetitive data continue to develop, a lot more is known today than was known just a few years back when repetitive sequences often was ignored. However, there are still limitations in the tools for sequencing and assembling of genic data when it comes to repetitive DNA or RNA sequences. One should therefore be particularly observant when working with repetitive genomic data. Parts of this study aims to investigate and highlight how these issues affect the alignment of the transcriptome to the genome assembly of Atlantic cod, and how the use of different tools can give different results.

(19)

2 Aims

This thesis aims to investigate how the transcriptome of Atlantic cod map to the different genome assembly versions of Atlantic cod and evaluate the quality of the genome-

transcriptome alignments. Furthermore, the hypothesis that STRs contributes to transcripts mapping partially to the genome assembly of Atlantic cod will be investigated with different approaches. Based on this, I will consider if type, amount, and location of STRs affects the mapping of a transcript to the genome assembly. Moreover, I will examine if there are common denominators in biological function among transcripts that are partially aligned.

I will use two different mapping tools in order to investigate which generates the best

alignment, and if one of them are better at aligning genomic data highly enriched with STRs.

Finally, I will look closer on some transcripts that do map to the genome assembly of Atlantic cod. Trough analysis of their annotation and by searching for sequence similarity, I aim to investigate if they are true transcripts of the transcriptome of Atlantic cod or not, and if some gene annotations could be missing from the genome assembly.

(20)

3 Material and Methods

3.1 Setup

The high-performance computing (HPC) facility, Abel, was used as platform for data storage and running analysis. During the period of this project, the platform was migrated to a new HPC cluster – Saga.

The command line was mainly used to run analyses and manage data. Python was used as programming language in some scripts, and Conda was primarily used as package manager when installing different tools for analysis of the datasets.

3.2 The genomic assembly datasets of Atlantic cod

The datasets that were used in this research were provided by CEES at the department of Bioscience at University of Oslo. The main dataset was the transcriptome assembly of Atlantic cod, which is generated from the PacBio Iso-Seq sequencing platform. The

transcriptome was isolated and extracted from several tissues, and at different developmental stages of a NEAC individual (Torresen et al., 2017). The transcriptome consists of 34 652 transcripts.

The transcriptome of Atlantic cod was mapped to the three different versions of the Atlantic cod genome assembly. The three genome assemblies of Atlantic cod used was generated from read datasets from the same cod individual, belonging to the NEAC population (Torresen et al., 2017). The individual was estimated to be 8 year of age based on otolith rings (Star et al., 2011). GadMor1 is the first genome assembly of Atlantic cod and is generated from 454- sequecing. GadMor2 is the second and improved version of the Atlantic cod genome assembly, and is generated from PacBio sequencing, Illumina and reads from the first

assembly. GadMor3 is the third and even more improved Atlantic cod genome assembly and is generated from an improved version of the PacBio sequencing platform, in addition to the supplementary technologies; 10XGenomics, Arima HiC and Bionano optical maps.

3.3 Annotation of the transcriptome

The transcriptome assembly was annotated using the de novo annotation tool, dammit version 1.2, which is a transcriptome annotator tool. Dammit searched for annotations in the

following databases; Pfam-A, Rfam, OrthoDB, BUSCO databases (vertebrata) and Uniref90.

(21)

HMMR, LAST and TransDecoder.³ 33 224 out of 34 652 transcripts in the transcriptome of Atlantic cod were annotated (see Appendix 6.2 for script that was used).

It is worth noting that dammit did not assign the same transcript ID name to the annotations in the output file as the transcript ID name in the original fasta file. Whenever the software could not annotate a transcript, it skipped this transcript in the output-file, but it did not skip one numerical value when an ID name was assigned to the next annotated transcript. To solve this problem dammit generated a name-map.

3.4 Mapping of the transcriptome to the genome assemblies of Atlantic cod

The transcriptome assembly of Atlantic cod was mapped to the three genome assemblies of Atlantic cod (gadMor1, gadMor2 and gadMor3) to investigate how the transcriptome aligned to each assembly. Two different mapping tools were used, the genomic mapping and

alignment program (GMAP) (Wu & Watanabe, 2005) version 3 and minimap2 (Li, 2018) version 2.17. The difference in the alignment when aligned with to different mapping tools were investigated in order to evaluate the mapping tools.

GMAP became available in 2005 and is a mapping tool suited for aligning cDNA derived from mRNA to genomic segments. It is able to align cDNA at a higher speed and accuracy than other tools providing the same operations, at the time the program was made available (Wu & Watanabe, 2005). In 2018, the minimap2 mapping tool became available. This tool was developed because of the fast development in sequencing technology which led to a necessity to map long reads of mRNA to a reference genome. The program is especially recommended for aligning PacBio reads. According to the developers of the program, a comparison of minimap2 to other mapping tools, including GMAP, outperformed these in accuracy and quality of the analysis (Li, 2018). The mapping tools were chosen because they are tools that preforms spliced read alignments, which is necessary when aligning

transcriptomes to a reference genome, and because they are the currently most recommended transcriptomic alignment tools.

3.4.1 Mapping procedures

In the beginning of this project, a different practice with another set of options were chosen when the transcriptome of Atlantic cod was mapped to the genome assemblies of Atlantic

3 See http://dib-lab.github.io/dammit/database-usage/ for information about the databases and how to run dammit

(22)

cod. These options were changed midways in the project because a more recommended practice for aligning PacBio reads to a reference genome came to my awareness. The options that were applied with each mapping tool in the first and second mapping practices, and the differences between the practices are described below.

With minimap2 the options used were -ax splice -uf -C5. These commands specified the output to be a SAM format in a CIGAR string;⁴ ran the alignment in a mode recommended for aligning long-read spliced alignments; found canonical splicing sites GT-AG on the transcript strand; and introduced the cost 5 for a non-canonical splicing (see Appendix 6.2 for the script that was used).⁵

When running GMAP, the command gmap_build built the database of the genome assembly from a fasta file. The options used together with this command were -d genome -D . -k 13, which built the database named “genome” and used a k-mer size of 13 in the database. For running the alignment, the command gmap was used together with the options -t 5 -n 0 -D . -d genome -f 3 -B 5. The two first options (-t 5 -n 0) specified the number of threads to be 5, and the number of paths to be 0, and were options that optimized the running time of the program.

The -d command specified mapping of cDNAs to a fasta file, -f 3 specified the output to be GFF3 format, and -B 5 was an option that specified specific computational settings (see Appendix 6.2 for script that was used).⁶

In addition to the options described for GMAP above, different values for --min-trimmed- coverage were set in order to regulate the minimum coverage of transcripts that mapped to the alignment. The same option was not available in minimap2. Minimum coverage of transcripts is defined as the minimum percentage of the total transcript that have to align continuously to the genome in order for it to be reported as a mapped transcript. The regulation of minimum coverage was done in order to investigate how this affected the number of transcripts that mapped to the genome assembly. The option --min-identity was held constant at 0.85 in every alignment. Minimum identity is defined as the minimum percentage of the nucleotides in the transcript sequence that have to match to the genome sequence for it to be reported as a mapped transcript.

(23)

As mentioned, a more recommended practice for aligning PacBio reads to a reference genome was taken in use later in the project. Only alignments generated from mapping when this practice was followed were further analyzed in the project. The practices for the alignment procedure is summarized in cDNA_Cupcake provided by Magdoll (E. Tseng, n.d.), where minimap2 is considered the most recommended mapping tool, and GMAP is the second recommendation. The alignment was therefore preformed with the same mapping tools as were used in the first practice.

The options used when the transcriptome was aligned the second time with minimap2 was -t 16 -ax splice -uf -secondary=no -C5 -O6,24 -B4. The difference from the previously used options was; that the output only reported the best alignments; a specification used for mapping long assemblies to a reference with up to 20% divergence in the sequences; and, a mismatch penalty of 4 (see Appendix 6.2 for script).

For GMAP, the options used the second time of aligning the transcriptome were -t 16 -n 0 -D . -d genome -f samse -B 5 -cross-species –max-intronlength-ends 200000 -z sense_force. The differences from the first mapping practice were the specification of output in SAM format;

an option that specified the program to use a more sensitive search for canonical splicing; an option that specified the first or last intron to be maximum 200 000 base pairs and; an option that specified that only the sense strand of cDNA was mapped (see Appendix 6.2 for script).

In addition to these changes in options, the most important step that optimized the alignment, was the procedure where redundant isoforms were collapsed in order to obtain a final set of unique, full-length, high-quality isoforms. The script collapse_isoforms_by_sam.py was used for this, which was retrieved from GitHub: https://github.com/Magdoll/cDNA_Cupcake. With this script transcripts that were nearly identical, and mapped at the same location in the

genome, were considered to be the same transcript. When using this script, it was also possible to regulate the cut-off values for minimum coverage in the alignments with both GMAP and minimap2.⁷ A series of alignments with increasing cut-off values for minimum coverage were therefore also performed with this mapping practice, in order to investigate how it affected the number of transcripts that mapped and did not map. Table 1 shows the cut- off values for minimum coverage and identity that were used.

7 See https://github.com/Magdoll/cDNA_Cupcake/wiki/Cupcake:-supporting-scripts-for-Iso-Seq-after- clustering-step for description of how to regulate cut-off values for minimum coverage.

(24)

Table 1: Cut-off values for minimum coverage and identity that were used when the transcriptome was mapped to the genome assembly of Atlantic cod.

Minium coverage (%)

30 40 50 60 70 80 90 99

Minimum identity (%)

85 85 85 85 85 85 85 85

With the recommended practices, the transcriptome of Atlantic cod was only mapped to the gadMor3 genome assembly, and not to the previous genome assembly versions.

3.4.2 Extraction of transcripts and fasta sequences

In order to further analyze the alignment and the discordance between the transcriptome and the genome assembly, transcripts that mapped and transcripts that did not map in the different alignments had to be directed into separate files. A file containing the transcript IDs of all transcripts in the transcriptome was therefore generated. This file was compared to a file containing the transcript IDs of all transcripts that mapped to the genome assembly.

Transcripts IDs that were not present in both files were transcripts that did not align to the genome assembly (see Appendix 6.2 for script).

The of minimap2 (without the use of Magdoll’s mapping practices) was in SAM format.

Hence, the method used for extraction of unmapped transcripts was done differently. The samtools software was used to access the SAM file. See Appendix 6.2 for script of how the file of unmapped transcript IDs was generated.

An aim of this project was to analyze transcripts that aligned partially to the genome of Atlantic cod. Hence, transcripts that mapped with more than 30%, but less than 90% of the sequence were extracted (see Appendix 6.2 for script that was used). The extraction was done in order to analyze transcripts that partially mapped to gadMor3 and exclude transcripts that mapped almost completely (more than 90%) and transcripts that did not map at all or had a very low coverage (less than 30%). The fasta sequences of these transcripts were extracted with the tool seqtk (see Appendix 6.2 for script that was used).⁸

3.5 Further analysis of partially aligned transcripts

The transcripts that partially aligned to gadMor3 were analyzed with a selection of tools in order to investigate the STR density and enrichment of specific biological processes

(25)

associated with these transcripts, and to detect differences in results when using different mapping tools. The different tools and how they were used are described below.

3.5.1 Identification of short tandem repeats (STRs) with Phobos and Sat-stat

STRs in the Atlantic cod transcriptome were detected using Phobos version 3.3.12. This is a highly accurate tandem repeat search tool that is able to identify perfect and imperfect TRs in a unit size range from 1 bp to > 5000 bp (Mayer et al., 2010). In this project, STRs with a unit size of 1-10 bp were detected with Phobos. The options used were -s 12 –outputFormat o -U 10. These options specified that a minimum score above 12 was required for each TR; that the output-file was in Phobos native format, and that the detected STRs were in a range of 1-10 bp (see Appendix 6.2 for script).

STR characteristics, such as the density of STRs in the genomic data of interest, were

computed using the program Sat-stat version 1.3.12, which is developed by Christoph Mayer (Mayer et al., 2010). A config file was provided (see Appendix 6.4), which gave an output file with different statistics of the STRs in the dataset and a GFF file. It is worth to note that in the file summarizing statistics of specific types of repeats, Phobos reports complimentary types of STRs together and lists them as only the first STR in alphabetical order. For example, when STR density of the AC di-repeat was shown, it also included the TG di-repeat, and AGC also included TCG.

3.5.2 Statistical testing in R

It was tested statistically if there was more bp in STRs in transcripts that partially mapped to the genome assembly, than in transcripts that aligned with more than 99% of the sequence.

This was done with a Welch t-test, which is a form of t-test that one uses when the two groups compared have unequal sample sizes, and one does not assume that the variance in the two groups are the same. Compared to a regular t-test, it is less restrictive. The alternative hypothesis was that the mean of base pairs in STRs in transcripts that partially mapped to gadMor3 was higher than in transcripts that mapped with more than 99% of the sequence. The null hypothesis was that the mean number of base pairs in STRs in the two groups were equal.

See Appendix 6.5 for how the t-test was performed in R.

In order to perform this test, the number of bp in STRs in each transcript was calculated from statistical information obtained from Sat-stat. One file for all transcripts that mapped with more than 99% of the sequence, and one for partially mapped transcripts were generated.

(26)

These files contained three columns, first with transcript ID, second with length of transcript and third with number of repeated base pairs per mega base pair (rbp/Mbp) (obtained from Phobos output file) (see Appendix for script that was used to generate the files). The files were transferred from Saga to local directory with rsync and read into R. From the

information in these files, the number of base pairs in STRs in each transcript was calculated with built in R-commands. An example of how number of base pairs in STRs in transcript/33 was calculated is shown below. The transcript was 6019 bp long and had 2492.11 rbp/Mbp.

𝑟𝑏𝑝

𝑀𝑏𝑝= 2492.11 𝑟𝑏𝑝

1 000 000 𝑏𝑝= 0.0024922 𝑟𝑏𝑝/𝑏𝑝

0.0024922 𝑟𝑏𝑝

𝑏𝑝 × 6019 𝑏𝑝 = 15 𝑟𝑏𝑝

3.5.3 InterProScan of the transcriptome assembly

InterPro is a recourse that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites (Mitchell et al., 2019). InterProScan, provided by InterPro, is a software package that allows protein- and nucleic sequences to be scanned against predictive models, provided by different databases.

When running an InterProScan, one is provided an output file that assigns gene ontology terms (GO terms) to the sequences of the insert file. These GO terms were analyzed in order to investigate if there were common denominators in biological functions of the partially mapped transcripts. The version InterProScan-5.27-66.0 was used in this project.

The insert file was the transcript sequences of the Atlantic cod transcriptome translated into peptide sequences, which was generated by transdecoder when annotating the transcriptome with dammit annotator tool. Transdecoder inserted the character “*” on the end of each translated peptide sequence, this character was removed in order to run the analysis on the peptide sequences (see Appendix 6.2 for script). A python script was made to access the transcripts and its associated GO terms in a separate file (see Appendix 6.3 for script).

Additionally, since translated protein sequences generated by dammit was used as input file, the wrong transcript IDs assigned by dammit had to be modified so they matched the original transcript IDs in the fasta file (see Appendix 6.2 for script).

(27)

3.5.4 Gene Ontology Enrichment analysis with Goatools

Goatools vv1.0.6 (Klopfenstein et al., 2018) perform gene ontology enrichment analysis that tests for overrepresentation or underrepresentation of gene ontology (GO) terms in a list of genes or gene products, in order to understand their biological significance. The tool is based on Fisher’s exact test. GO terms can describe the molecular function, cellular localization or the biological processes that the protein encoded by the gene is involved in.

Goatools was used to look for overrepresentation of GO terms among the transcripts that partially mapped gadMor3. Enrichment analysis of transcripts that were partially aligned when GMAP was used as mapping tool, and transcripts that were partially aligned when minimap2 was used as mapping tool was performed. In this project, a false discovery rate with the Benjamini and Yekutieli method was used to correct for multiple testing, and 0.05 was set as p-value (see Appendix 6.2 for script that was used).

3.6 Analysis of how STRs were aligned by the mapping tools

It was also analyzed directly if and how STRs were aligned differently by the two mapping tools. The tools and how they were used in these analyses are described below.

3.6.1 Bedtools

Bedtools version 2.29.2 consists of a wide range of tools that enables researchers to do analyses on genomic data. With the tools, researchers can intersect, merge, count,

complement, and shuffle genomic intervals from multiple files at a time. The tools support several widely used genomic file formats such as GFF/GTF, BED, BAM and VCF (Aaron R.

Quinlan, 2020).

The tool that was used to compare genomic intervals in this project was bedtools intersect. It was used to find STRs in gadMor3 that overlapped with the alignment performed by one of the mapping tools, and not the other. It was further investigated if there were any considerable differences in number of aligned STRs in the same alignment performed by different mapping tools, GMAP and minimap2.

Described in more details, all STRs annotated in gadMor3 that overlapped with the alignment between the transcriptome and gadMor3 when GMAP was used as alignment tool, and 99%

coverage and 85% identity were set as cut-off values, were identified with bedtools intersect.

Those STRs that overlapped with the GMAP alignment, and not overlapped with the

alignment when minimap2 was used as mapping tool were found with a second operation of

(28)

bedtools intersect. Then, the reverse was done. All STRs in gadMor3 that overlapped with the alignment when minimap2 was used as mapping tool, with minimum 99% coverage and minimum 85% identity as cut-off values, were identified. Followed by identification of the STRs in gadMor3 that overlapped with the minimap2 alignment, and not overlapped with the GMAP alignment. The file format of the tracks was GFF files. The file with coordinates of STRs was set as first track, and the files with alignments were set as second track. The option -wa was set in order to write the original entry of the first track to the output file (see

Appendix 6.2 for script that was used).

Prior to this, the GFF files of the alignments were modified because they contained both aligned exons of a transcript and the alignment of the total transcript. The GFF files were modified to only contain coordinates of aligned exons, in order to obtain a correct

representation of the number of STRs in the alignment (see Appendix for script).

3.6.2 The Genomic HyperBrowser

The Genomic HyperBrowser powered by Galaxy is a service provided by Elixir Norway. It is a system that provides a range of tools to analyze genomic tracks, and compute statistics from genomic data of interest (Afgan et al., 2018). In this project, the HyperBrowser version 2.0b7 was used to determine the relative position of STRs in transcripts that contained STRs only aligned by one of the mapping tools, either with GMAP or minimap2, and not the other. The purpose of this analysis was to investigate whether the position of an STR in a transcript affected if the STR was aligned by one of mapping tools or not, which could contribute to explain differences in resulting alignments by the mapping tools.

In order to perform this analysis, a GFF file covering STRs in gadMor3 that overlapped the alignment with GMAP and not minimap2 (generated with bedtools intersect), and a GFF file covering the transcriptome-genome alignment with a cut-off value at minimum 99% coverage with GMAP, were uploaded to the HyperBrowser and used as first and second track,

respectively. The gadMor3 genome assembly, already available in the HyperBrowser, was chosen as genome build. The alternative descriptive statistics was chosen, followed by the alternative relative position within segment. If the analysis showed that STRs only aligned by GMAP were located in a specific location in the transcripts, it could tell that minimap2 had difficulties with aligning STRs in this position. The same procedure was followed when the

(29)

3.6.3 Integrative Genomic viewer (IGV)

IGV is a visualization tool that enables exploration of genomic datasets (Robinson et al., 2011). In this project it was used to compare different tracks of genomic data from the Atlantic cod transcriptome and visualize how they were different.

As an example, the combination of tracks used was; one track of the alignment where minimum 30% of the transcripts covered the genome and another track of the alignment where minimum 90% of the transcripts covered the genome, both when aligned with GMAP and minimap2, and a track with annotated STRs in gadMor3. These tracks were added to the gadMor3 genome sequence. With this, it was visualized if STRs were located in unaligned regions of transcripts that partially aligned to the genome of Atlantic cod, and if there were differences in partially aligned transcript when aligned with GMAP and miniamp2.

3.7 Analysis of transcript that did not map at all to gadMor3

The transcripts that did not map to gadMor3 when minimap2 were used as mapping tool in the first practices was considered a good representation of transcripts that did not map at all to gadMor3. A random selection of these were further analyzed. In order to evaluate if they could be contaminants, or if they were transcripts that originate from the true transcriptome of Atlantic cod, their sequence similarity were searched for in the non-redundant database of protein sequences with BLASTx (Altschul, Gish, Miller, Myers, & Lipman, 1990). It was searched in all organisms available in this database. If they showed high sequence similarity to genes in simple organisms, as for example species of zooplankton, they were considered as contamination in the sequenced sample. If the transcripts showed sequence similarity to genes in other species of fish, the search in the database was limited to two closely related

organisms, Atlantic herring and Zebrafish, and also to Atlantic cod. If the transcript sequence had sequence similarity to a gene in these species (and not to Atlantic cod), the location in the genome and the genes annotated upstream and downstream of the gene that the transcript showed sequence similarity to, were investigated. It was looked for surrounding regions that were conserved across these species (regions of synteny) and evaluated if the gene the transcript sequence was similar to could be missing from gadMor3, or not annotated, in this region.

(30)

The genome browsers in NCBI of gadMor3 of Atlantic cod and the two related species, Atlantic herring and Zebrafish, was used for investigation of the annotation of these unaligned transcripts in Atlantic cod.⁹

9gadMor3 genome browser: https://www.ncbi.nlm.nih.gov/genome/gdv/browser/genome/?id=GCF_902167405.1 Zebrafish genome browser: https://www.ncbi.nlm.nih.gov/genome/gdv/browser/genome/?id=GCF_000002035.6

(31)

4 Results

4.1 Mapping

The transcriptome of Atlantic cod was mapped against the different Atlantic cod genome assemblies; gadMor1, gadMor2 and gadMor3, with the use of two different mapping tools;

GMAP and minimap2. Table 2 shows the number of transcripts that did not map at all when minimap2 was used as mapping tool, many of the transcripts are partial alignments. From this table one can see that the highest fraction of transcripts did not map to gadMor3, and most transcripts mapped to gadMor1, which was the first and most fragmented genome assembly.

Table 2: The table shows the number of transcripts that did not map to the different assembly versions of Atlantic cod when minimap2 was used as mapping tool. The percentage of non-mapping transcripts in the transcriptome are shown in parenthesis.

Genome assembly Non-mapping

gadMor3 167 (0.48%)

gadMor2 120 (0.35%)

gadMor1 48 (0.14%)

Table 3 shows the number of transcripts that did not map when GMAP was used as mapping tool and different cut-off values for minimum coverage of the transcripts in the alignment were regulated. The cut-off value for minimum identity between the transcript and genome sequence was 85% for every alignment. The results show that the highest fraction of transcripts did not map to gadMor1, and the lowest fraction of transcripts did not map to gadMor3. The same trend was seen in all alignments with different cut-off values for minimum coverage. The difference between Table 2 and Table 3 is that in Table 2, partial alignments are counted as mapping, while in Table 3 the alignments have concrete cut-offs for minimum coverage. In Table 2 more transcripts align to gadMor1, than to gadMor2 and gadMor3, while in Table 3 more transcripts align to gadMor3 than the other genome assemblies. One reason for this could be that gadMor1 might contain more of the complete genic content than the other to assembly versions, so that most transcripts have at least some regions that align to the genome, but since the genome assembly is fragmented, they align at less than 50% coverage.

(32)

Table 3: Number of non-mapped transcripts in each genome assembly when GMAP was used as mapping tool, and when different cut-off values for minimum coverage were set. The percentage of how many transcripts that did not map in the total transcriptome are shown in parenthesis.

Minimum coverage (%)

gadMor3 gadMor2 gadMor1

50 527 (1.5%) 646 (1.8%) 1820 (5.2%)

60 657 (1.8%) 853 (2.5%) 2750 (7.9%)

70 812 (2.3%) 1 121 (3.2%) 3 563 (10.3%)

80 1 042 (3.0%) 1 477 (4.3%) 4 468 (12.9%)

90 1 451 (4.2%) 1 948 (5.6%) 5 553 (16.0%)

100 28 775 (83.0%) 28 922 (83.5%) 29 618 (85.5%)

As mentioned in Material and Methods, I became aware of a recommended procedure for aligning Iso-Seq data to a reference genome midway in the project. The following results from mapping of the transcriptome to the genome assembly are done according to this

procedure. Notably, the results in Table 3 show that the transcriptome aligns more transcripts to gadMor3 than the other two assembly versions. Hence, the transcriptome was only mapped to gadMor3 with the new mapping procedure, and further analysis was done with this setup.

The number of transcripts that did not map when the transcriptome was mapped with GMAP and minimap2, and different cut-off values for minimum coverage were used are shown in Table 4. The cut-off value for minimum identity was held constant at 85% in every alignment.

The table show that minimap2 mapped most transcripts to gadMor3 when a cut-off value for minimum coverage at 80% or above was used, and the number of transcripts GMAP mapped were slightly lower with the same parameters. However, when lower cut-off values for minimum coverage (70% or below) was applied, GMAP mapped more transcripts than minimap2.

(33)

Table 4: The table shows the number of transcripts that did not map at different cut-off values for minimum coverage, when mapped with GMAP and minimap2. The percentage of non-mapped transcripts in the transcriptome are shown in parenthesis.

Minimum coverage (%) GMAP minimap2

30 408 (1.2%) 693 (2.0%)

40 461 (1.3%) 721 (2.1%)

50 532 (1.5%) 763 (2.2%)

60 709 (2.0%) 858 (2.5%)

70 890 (2.6%) 978 (2.8%)

80 1 135 (3.3%) 1 058 (3.1%)

90 1 552 (4.5%) 1 225 (3.5%)

99 3 406 (9.8%) 2 941 (8.5%)

Transcripts that aligned with more than 30%, but less than 90% of the sequence to gadMor3 were extracted. The intersection between these two groups of transcripts was assumed to be a good representation of the transcripts that had parts of the sequence mapping to the genome, and they are therefore referred to as the partially mapped transcripts. The number of partially mapped transcripts are represented in Table 5.

Table 5: The table shows the number of transcripts that partially mapped to gadMor3, and the difference in number of partially mapped transcripts between the two mapping tools, GMAP and minimap2. The percentage of transcripts that partially mapped in the transcriptome are shown in parenthesis.

Genome assembly GMAP minimap2

gadMor3 974 (2.8%) 628 (1.8%)

4.2 Identification of STRs in partially mapped transcripts with Phobos

The overall STR density in transcripts that mapped to gadMor3 with more than 30% of the sequence, but less than 90%, both when mapped with GMAP and minimap2, was compared to the STR density in transcripts that mapped with more than 99% of the sequence to gadMor3 in Figure 3.

(34)

Figure 3: The figure shows the total density of STRs in transcripts that mapped with more than 99% of the sequence (denoted “Mapping”), and in transcripts that mapped with more than 30%, but less than 90% (denoted

“Partially mapping”), both when mapped with GMAP and minimap2.

The results in Figure 3 show that there was a considerably higher density of STRs in transcripts that partially mapped to gadMor3, compared to transcripts that had a nearly complete alignment to gadMor3. This applied to both mapping tools, minimap2 and GMAP.

Further, transcripts that were partially aligned with GMAP had a slightly higher density of STRs, than transcripts partially aligned with minimap2.

Figure 4 shows the difference in density of different unit-sizes of STRs in transcripts that mapped with more than 30% but less than 90% of the sequence, when mapped with GMAP and minimap2, and are compared to the transcripts that mapped more than 99% of the sequence with each mapping tool.

0 5000 10000 15000 20000 25000 30000

Mapping,

minimap2 Partially mapping,

minimap2 Mapping, GMAP Partially mapping, GMAP

bp/Mbp

(35)

Figure 4: The figure shows the density of each unit-size (denoted as numbers from 1-10 on the horizontal axis) of STR in transcripts that had a nearly complete alignment and in transcripts that mapped partially to gadMor3, both when GMAP and minimap2 were used. Blue columns represent transcripts that mapped to gadMor3 with GMAP, grey columns represent transcripts that mapped to gadMor3 with minimap2, orange columns represent transcripts that partially mapped to gadMor3 with GMAP and yellow column represent transcripts that partially mapped to gadMor3 with minimap2.

The results in Figure 4 show that di-repeats were especially abundant among partially mapped transcripts mapped with GMAP, and tri-repeats were also slightly more enriched among partially aligned transcripts mapped with GMAP, compared to transcripts that had a nearly complete alignment to gadMor3. Further, Figure 4 show that partially aligned transcripts mapped with minimap2 were more enriched in tetra-repeats and di-repeats compared to transcripts that had a nearly complete alignment to gadMor3. Minimap2 and GMAP partially aligned different unit sizes of STRs. When minimap2 was used as mapping tool tetra-repeats were more abundant, while di-repeats were more abundant when GMAP was used as mapping tool.

A Welch t-test was used to investigate whether there was a significant difference in the mean of number of base pairs in STRs in transcripts that partially mapped and in transcripts that had a nearly complete alignment to gadMor3. The mean of number of base pairs in STRs in the two respectively groups and the resulting p-values from the t-tests are represented in Table 6.

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

1 2 3 4 5 6 7 8 9 10

bp/Mbp

Mapping; GMAP Partly mapping; GMAP Mapping; minimap2 Partly mapping; minimap2

(36)

Table 6: The table shows the mean number of bp in STRs in partially mapped transcripts, and in transcripts that mapped with more than 99% of the sequence. In addition to the computed p-value that was generated from the t- test that tested if the mean of bp in STRs in the two groups was equal or not. This was done for partially aligned transcripts from both mapping tools.

Mapping tool Mean bp in STRs in partially mapped transcripts

Mean bp in STRs in mapped transcripts

P-value

GMAP 57.3 22.5 < 2.2 x 10^-16

Minimap2 45.5 24.2 < 2.2 x 10^-16

The tests confirm that there was a significant difference in the mean number of base pairs in STRs in transcripts that partially mapped and in transcripts that mapped with more than 99%

of sequence to gadMor3, both when mapped with GMAP and minimap2.

Table 7 shows the density of types of STRs occurring most frequently in transcripts that partially mapped to gadMor3. These are compared to the density of the same type of STR in the total transcriptome of Atlantic cod.

Table 7: The table shows the density of the most common types of STRs in the transcripts that mapped with more than 30%, but less than 90% of the sequence to gadMor3, from both of the mapping tools. They are compared to the density of the same types of STRs in the total transcriptome of Atlantic cod.

Repeat type Minimap2 (bp/Mbp) GMAP (bp/Mbp) Total transcriptome (bp/Mbp)

AC/TG 9 270 13 682 4 108

AG/TC 787 2 445 888

AGC/TCG 147 2 018 391

ACCT/TGGA 5 556 348 95

The di-repeats AC/TG and AG/TC were most abundant in partially mapped transcripts, especially when GMAP was used as mapping tool. AGC/TCG tri-repeats were abundant in partially mapped transcripts mapped with GMAP, and ACCT/TGGA were abundant in partially mapped transcripts mapped with minimap2.

(37)

4.3 Enrichment of GO terms in partially mapped transcripts

22 251 of in total 33 225 transcripts in the transcriptome of Atlantic cod had GO terms connected to them. Goatools was used to check for enrichment of GO terms among partially mapped transcripts compared to GO terms inn all transcripts of the transcriptome of Atlantic cod.

Transcripts that partially mapped to gadMor3 with minimap2 as mapping tool, had in total 26 GO terms that were overrepresented. 12 of these were biological processes, 1 was a specific cellular component, and 12 were specific molecular functions. Furthermore, transcripts that partially mapped to gadMor3 with GMAP as mapping tool had enrichment of 30 GO terms.

18 of these were biological processes, 1 was a specific cellular component, and 11 were molecular functions. The enriched GO terms that were biological processes and had an FDR corrected p-value below 0.05 are shown in Table 8.

(38)

Table 8: The table shows the GO terms and name of biological processes that were enriched in the group of partially mapped transcripts. The p-values are listed to show significance, and the ratio of transcripts that were linked to given GO term in the group of partially mapped transcripts, and in the total transcriptome are also shown. 8a) lists GO terms that were enriched in partially mapped transcripts when GMAP was used, 8b) lists GO terms that were enriched in partially mapped transcripts when minimap2 was used.

Gene ontology ID

GO name P-value Ratio in

partially mapped transcripts

Ratio in transcriptome

8a) GO terms associated with partially aligned transcripts mapped with GMAP

GO:0042981 Apoptotic process 2.55e-09 24/881 146/33225

GO:0007186 G protein-coupled receptor signaling pathway

2.9e-06 33/881 395/33225

GO:0002376 Immune system process 9.85e-05 30/881 399/33225

GO:1903510 Mucopolysaccahride metabolic process

0.0183 4/881 13/33225

GO:0045944 Positive regulation of transcription by

RNA polymerase II 0.0211 3/881 6/33225

GO:0006022 Aminoglycan metabolic process 0.0311 9/881 88/33225 8b) GO terms associated with partially aligned transcripts mapped with minimap2

GO:0032456 Endocytic recycling 1.94e-06 8/532 25/33225

GO:0009410 Response to xenobiotic stimulus 0.000667 3/532 3/33225 GO:0038166 Angiotensin-activated signaling

pathway

0.00224 3/532 4/33225

GO:0008284 Positive regulation of cell population proliferation

0.00875 3/532 6/33225

GO:0016197 Endosomal transport 0.00917 8/532 88/33225

GO:0006813 Potassium ion transport 0.0202 8/532 99/33225

4.4 Analysis of difference in mapped STRs in the same alignment preformed with two different mapping tools

In order to look closer at the hypothesis that GMAP and minimap2 mapped differently because of different capacity of aligning STRs, the regions in each alignment, with minimum

(39)

Furthermore, the STRs in gadMor3 that overlapped with one of the alignments, but not the other, were identified. The number of such cases for each alignment are shown in Table 9.

Table 9: Second column shows number of STRs that overlapped with the alignment between the transcriptome of Atlantic cod and gadMor3 (either with GMAP or minimap2), third column shows number of STRs that did not overlap with the alignment when the other mapping tool was used.

Mapping tool Number of STRs in gadMor3 that overlap with alignment

Number of STRs in gadMor3 that do not overlap with second alignment

GMAP 10 435 291

minimap2 10 673 529

It was investigated with the HyperBrowser if the position of an STR located within a transcript affected the ability for the two different mapping tools to map the transcript. The overall average relative position of STRs in transcripts that had STRs only aligned by GMAP, was estimated to be 0.67, where 0 was the 5’ end of the transcript, and 1 was the 3’ end of the transcript. Furthermore, the average relative position of STRs in transcripts that had STRs only aligned by minimap2, was estimated to be 0.64. The average relative positions of STRs within transcripts that were only aligned by one of the mapping tools, either GMAP or minimap2 are shown in Figure 5. The transcripts were grouped into the chromosomes in gadMor3 they aligned to. For example, the average relative position of STRs in transcripts that had STRs only aligned by GMAP in the region of chromosome 2 in gadMor3 was 0.25, and it was 0.76 in STRs only aligned by minimap2 on this chromosome. The total picture of Figure 5 shows great variation in relative position of STRs in transcripts that had STRs only aligned by one of the alignment tools, and not the other.

The transcriptome of Gadus morhua and its discordance with the genome

The transcriptome of Gadus morhua and its discordance with the genome

How does enrichment of short tandem repeats influence the alignment?

Kari Jerve Ramsøy

Master of Science Thesis

Genetics and developmental biology 60 credits

Department of Biosciences

The Faculty of Mathematics and Natural Sciences UNIVERSITY OF OSLO

The transcriptome of Gadus morhua and its discordance with the genome

How does enrichment of short tandem repeats influence the

alignment?

Abstract

Acknowledgements

Table of contents

1 Introduction

2 Aims

3 Material and Methods

4 Results