Profiling the neoantigen potential of somatic cancer mutations
João Carlos Martins das Eiras
Master thesis at The Department of Informatics The Faculty of Mathematics and Natural Sciences
UNIVERSITETET I OSLO
July 2017
© João Carlos Martins das Eiras 2017
Profiling the neoantigen potential of somatic cancer mutations João Carlos Martins das Eiras
http://www.duo.uio.no/
Printed: Reprosentralen, Universitetet i Oslo
Abstract
It is well established in the field of immunology and oncology that tumours express antigens unique to the cancer, called neoantigens, that can be recognized by T-cells [1] which target the cancer cell for destruction. The expression of antigens, depends on several regular cell mechanisms like gene expression, the antigen processing pathway, and MHC binding. Cancer’s unique mutation profile contributes to the formation of mutated proteins which have the potential of producing neoantigens. By exploiting the neoantigen expression of cancer, immunotherapies can be developed, harnessing the biological processes of the adaptive immune system to attack the tumour.
Each tumour’s unique and unpredictable mutation profile poses a big challenge for the treatment of sequencing data coming from Next Generation Sequencing methods. With the set of somatic mutations, one can ask which is the set of potential neoantigens that the tumour expresses and with which frequency? The answer to this question has the potential of contributing significantly to the creation of immunotherapies, targeting both unique and shared neoantigens, shared among a population of cancer patients. The creation of shared neoantigen therapies have the extra benefit of being more cost effect and readily available than patient specific ones.
To answer the question, this project had the following goals: 1) integrate tools for DNA variant annotation that can detect peptide changes; 2) integrate MHC binding prediction tools; 3) use these tools with two gold standard databases of somatic variants, COSMIC and My Cancer Genome (MCG), and the list of cancer driver genes from the Cancer Gene Census (CGC), to study the frequency of neoantigens and detect patterns of neoantigens among known somatic cancer variants and known cancer driver genes.
The end result was a framework to identify and prioritize shared neoantigens in cancer, by implementing a variant annotation and MHC binding prediction pipeline, and a probability model to select common neoantigens in populations. Applied to COSMIC and CGC, this framework revealed interesting trends, like the high load of potential MHC binder peptides of cancer driver genes such as TP53 gene, and the relative high MHC binding potential of the SLC34A2 gene.
Applied with mutation frequency data from the MCG, this framework produced a visualization of how each MCG variant and gene affected the probability of shared neoantigens with a given HLA context, and a visualization of potential benefit of shared immunotherapies targeting shared neoantigens.
Preface
My first job in 2007 was at a software company that developed a big information system for hospitals, clinics, and primary care centres. Even though overworked and in a constant hectic environment there was great sense of drive. The product was used by tens of thousands of healthcare professionals, which in the end could deliver better care to their patients and even save lives. Meanwhile, I moved onto new projects and a whole new country.
In 2015, faced with lack of motivation at work and with yet another downsizing at my employer, I decided it was time to do a change in my career, from web technologies to something more meaningful. The consumer market was not appealing for me. Work, release, and wait for the product to become obsolete after 6 months. The choice became clearer and clearer. Take some time off from a 9 to 5 job, go back to school, learn something new and see what new opportunities arise.
During studies at the University of Oslo, I came into contact with Trevor Clancy through the regular channels for master students looking for their projects and destiny. A project related with genetics and cancer. Sounded fascinating and simultaneously felt uneasy. A completely new area of knowledge and expertise. But, if I succeeded, that could mean lives being saved. And a whole new range of doors opening.
And so began my relationship with OncoImmunity, which during the last year and a half has enabled me with a lot of new knowledge and practical experience of what bioinformatics can do for the field of oncology.
My goals and the goals of company were well aligned with the goals of the project and resonated well with me. To be at the forefront of cancer immunotherapy, or more specifically, work with state of the art tools and knowledge to help usher a new field of bioinformatics, benefiting from the economy of scale achieved quite recently from Next-Generation Sequencing of DNA.
The project, with its ups and lows, was largely successful. For me and my colleagues at OncoImmunity, in the end it raises more questions than it answers. And that’s a good thing.
Gaps of knowledge are what make discovery exciting.
None of this could have been achieved without the persistent oversight of my supervisor Trevor Clancy, and the helpful cooperation of my colleagues Richard Stratford, Harald Kirkerød, Irantzu Anzar, Jonathan Feinberg, Chris Rose, Marius Eidsaa and Angelina Sverchkova. To all of them, thank you. I am also grateful to my co-supervisor at the University of Oslo, Eivind Hovig.
Table of Contents
1. Introduction ... 1
1.1. About the project ... 1
1.2. Context ... 1
1.3. Challenges of comprehensive neoantigen discovery ... 2
1.4. Objectives ... 3
2. Biology background ... 4
2.1. Genetics chemistry ... 4
2.2. The immunogenic journey of peptides in a cell ... 6
2.3. Cancer as a genomic disease and the generation of neoantigens in tumours ... 7
3. Bioinformatics background ... 9
3.1. Next Generation Sequencing ... 9
3.1.1. Dye sequencing ... 11
3.1.2. Single Molecule Real Time Sequencing ... 12
3.1.3. Ion semiconductor sequencing ... 13
3.1.4. Nanopore sequencing ... 13
3.2. DNA read assembly and mapping ... 14
3.2.1. De novo sequencing... 14
3.2.2. Standardized reference genomes ... 15
3.2.3. DNA sequencing with a reference ... 16
3.3. Variant calling ... 17
3.4. Variant calling of somatic variants ... 21
3.5. Variant annotation ... 22
3.5.1. VEP – Variant Effect Predictor ... 25
3.5.2. ANNOVAR ... 25
3.5.3. SnpEff ... 26
3.5.4. Varcode ... 26
3.5.5. Variant annotation tool comparison... 26
3.6. HLA typing ... 28
3.7. HLA peptide binding ... 29
3.8. Main observations on the state of the art ... 30
4. Pipeline for neoantigen discovery ... 32
4.1. Outline ... 32
4.2. Data model ... 34
4.3. Intersection with project goals ... 35
5. Development of an integrated variant annotation pipeline ... 36
5.1. Objectives ... 36
5.2. Setting up the first variant annotation tool ... 36
5.3. Adding support for VEP – Variant Effect Predictor ... 38
5.4. A small artificial test suite ... 40
5.5. Integrating ANNOVAR ... 42
5.6. Discussion ... 46
6. Development of the MHC binding prediction pipeline. ... 49
6.1. Outline ... 49
6.2. Input format ... 49
6.3. MHC Binders ... 51
6.4. Workflow ... 52
6.5. Observations ... 53
7. MHC binding prediction of known cancer variants: COSMIC ... 54
7.1. A comprehensive catalogue of known cancer variants ... 54
7.2. Experiment setup ... 56
7.3. Variant annotations: VEP versus ANNOVAR ... 57
7.3.1. Data Set ... 57
7.3.2. Annotation Comparison ... 58
7.3.3. Conclusions ... 62
7.4. MHC Binding ... 63
7.5. Neoantigen potential of cancer driver genes ... 66
7.6. Conclusions ... 70
8. Profiling clinically relevant cancer variants: My Cancer Genome ... 71
8.1. Introduction ... 71
8.2. Data preparation ... 72
8.3. The MHC binding potential of the MCG mutations... 74
8.4. Data model ... 77
8.5. The MHC binding potential within a cancer’s population ... 78
8.6. The MHC binding potential within a regional population ... 83
8.7. Conclusions ... 88
9. Final remarks and conclusions ... 89
9.1. Technical goals and results ... 89
9.2. Research goals and results ... 90
10. References ... 93
11. Appendix ... 99
11.1. Mouse B16 and CT26 mutated protein query ... 99
11.2. Cancer Gene Census MHC binding potential ... 101
11.3. My Cancer Genome Mutations and Variants ... 104
11.4. HLA Allele Frequencies ... 123
11.5. Probability of a specific gene having mutations which produce MHC bound peptides, for a given cancer type. ... 124
Table of Figures
Figure 1 – DNA molecule... 4
Figure 2 – Intron and exons. ... 5
Figure 3 – Genetic code. ... 6
Figure 4 – MHC Class I antigen processing pathway (image courtesy of OncoImmunity). ... 7
Figure 5 – Read depth versus error rate. ... 10
Figure 6 – Overlap of variants produced by different variant callers [51, p. 2226]. ... 22
Figure 7 – Outline of a neoantigen discovery pipeline from NGS sequencing data. ... 32
Figure 8 – Antigen discovery pipeline database mock-up. ... 34
Figure 9 – Workflow of the variant annotation program. ... 47
Figure 10 – MHC binding prediction pipeline. ... 49
Figure 11 – Workflow for the COSMIC analysis. ... 56
Figure 12 – Overlap between annotated transcripts. ... 57
Figure 13 – Overlap of gene region annotations. ... 59
Figure 14 – Overlap of annotation consequences. ... 60
Figure 15 – Comparison of annotations per variant type. ... 61
Figure 16 – Distribution of ic50 values. ... 65
Figure 17 – Pairwise comparison of MHC binding tools... 66
Figure 18 – Number of MHC bound peptides per driver gene, top-30 genes. ... 68
Figure 19 – Ratio of MHC binder peptides per total peptide pairs, top-30 genes. ... 68
Figure 20 – Ratio of MHC binder peptides to mutations and gene length, top 30 genes. ... 69
Figure 21 – My Cancer Genome front page, 2017-05-27 ... 71
Figure 22 – Binder peptide counts per HLA allele and MHC tool... 75
Figure 23 - Number of MHC bound peptides per MCG gene. ... 76
Figure 24 – Probability of producing a MHC bound peptide per MCG gene. ... 76
Figure 25 – Data model for the MCG analysis. ... 77
Figure 26 – Probability percentage of a specific cancer producing MHC bound peptides. ... 81
Figure 27 – Probability percentage of a specific cancer producing MHC bound peptides as function of mutation load ... 82
Figure 28 - Probability percentage of a specific gene producing MHC bound peptides within a specific cancer. ... 83
Figure 29 – Probability percentage of the 20 best mutations most probable of producing a MHC bound peptide. ... 83
Figure 30 – MHC binding potential in Europe. ... 86
Figure 31 – MHC binding potential in North America. ... 87
Figure 32 – MHC binding potential in North East Asia. ... 87
Figure 33 – Probability of a specific cancer producing MHC bound peptides, split by cancer and gene... 124
1. Introduction
1.1. About the project
This project “Profiling the neoantigen potential of somatic cancer mutations” was conducted in 2016/2017 at OncoImmunity, organization number 913 062 612, located at Ullernchausséen 64, 0379 OSLO, by João Carlos Martins Eiras, in the context of the master's program Informatics: Technical and Scientific Applications from the Department of Informatics, Faculty of Mathematics and Natural Sciences, University of Oslo.
1.2. Context
In February 2016, some newspapers [2] reported the success of experimental trials using T- cell therapy to treat late stage acute lymphoblastic leukaemia (ALL) conducted at Fred Hutchinson Cancer Research Center , in Seattle, WA, USA. In the most successful trial, 33 out of 35 patients went into remission. Quoting one of the researchers “This is unprecedented in medicine, to be honest, to get response rates in this range in these very advanced patients”. In this form of therapy, T-cells are harvested from the patient, they are trained to recognize specific markers, unique peptides that are exposed on the surface of the cancer cells, called epitopes, and then injected back in the patient, expecting that these trained T-cells will then recognize the tumour and attack it as if it was a foreign pathogen being attacked by the immune system [3]. The T-cell therapy and anti-cancer vaccines (which train the immune system to recognize the cancer's unique markers), together with other common forms of treatment like radio and chemotherapy, hold the potential of seriously increasing the rates of cure and survivability of cancer patients.
These developments may bring about a potential revolution in the field of oncology, making personalized medicine more of a short-term reality, than an unreachable vision of the distant future.
The field of bioinformatics is experiencing huge growth due to sequencing costs becoming progressively cheaper, sequencing methods having more quality and producing more data, and time needed for sequencing also becoming faster. While in 2005 sequencing a human genome would cost approximately a couple million USD, nowadays (in 2016) it is reported that full human genome sequencing can be done in a couple days for something like 1000 USD [4][5].
These trends are expected to increase over time. Since the quality and amount of information is increasing and the costs are decreasing, this enables a whole different economy of scale whereby
using genetic data from patients for targeted treatments is possible and affordable and unleashing demand for better bioinformatics tools and methods.
The development of relevant bioinformatics solutions serves many actors: students, researchers, patient selection for clinical trials, physicians investigating the best course of treatment. It is therefore an area of research expected to become more and more relevant, and grow in interest, resources allocated and market opportunities.
1.3. Challenges of comprehensive neoantigen discovery
Cancer’s unique unpredictable mutation profile poses a big challenge for the current state of the art technology, despite DNA sequencing technology and bioinformatics methods becoming better over time. The current methods, that rely on DNA chunking and amplification, when trying to reassemble the long strands of DNA, have to deal with high uncertainty rates while trying to detect cancer mutations with low frequency at the same time they need to filter out errors and noise in the data. The end result is not a 100% accurate snapshot of the cancer mutation profile.
Out of the many cancer mutations, some create amino-acid changes in proteins. Once proteins have served their biochemical purpose, or become damaged, they are broken down by the cells protein recycling machinery in 8 to 12 amino-acid peptides. These peptides then might be transported by a TAP molecule and bound to a MHC molecule, which exposes them on the membrane of the cell, potentially being recognized by a T-cell, targeting the cell for destruction [1]. This binding process involves complex pathways (antigen processing and the major histocompatibility complex) which behaviour is not entirely understood, but many tools exist that model that behaviour using different machine learning and statistical methods. Whether a T-cell can recognize a mutated peptide is also dependent on its abundance in the cell surface, its stability (how long it lasts on the cell surface before being further broken down) and the difference between the mutated and germline peptides, the latter which are tolerated in healthy individuals.
The mutated peptides that are recognized by the T-cells are termed “neoantigens” and there is increasing data showing that they play a role in determining immune recognition and the regulation of cancer, therefore being potential biomarkers for cancer immunotherapy [6].
The process of detecting neoantigens from DNA sequencing data has lots of uncertainty as will be discussed further. Thus, it is desirable to expand the scope the experiments, allowing trends and common markers to be found across multiple experiments, building up the statistical
significance of the findings, and intersecting individual patients’ data with large population studies.
The detection of common neoantigens across populations also has the benefit of supporting the creation shared immunotherapies that target a subset of the patient population, available off the shelf, trialled and tested, and more cost effective than creating a new personalized therapies from scratch that target just one individual [7].
1.4. Objectives
To achieve the goal of “Profiling the neoantigen potential of somatic cancer mutations”, the main goals of this project are to:
• To create an improved variant annotation framework which can be applied to neoantigen discovery.
• Integrate MHC binding predictors.
• Profile/describe the neoantigen potential of known cancer mutations, specifically from COSMIC and My Cancer Genome
• Predict probability of shared neoantigens, for any gene or variant, across cancer types and populations.
2. Biology background
2.1. Genetics chemistry
Cells of living organisms contain genetic information which describes instructions necessary for the cell’s biological functions. In multicellular organisms, this information is contained in the cell’s nucleus, although some types of cells have neither nucleus nor genetic information. The genetic information is encoded into DNA (deoxyribonucleic acid).
DNA’s structure is a double helix, where each strand is composed by nucleotides, basic components that connect to each other to form long chains [8].
DNA strands have direction, from the 5 prime end - abbreviated as 5’, also referred as upstream - to the 3 prime end - abbreviated as 3’, also referred as downstream. This naming reflects the chemical structure of the phosphate groups which are part of the nucleotides, and which link to form the backbone of the DNA strand. DNA direction is central to many processes that involve DNA molecules as traversal of the DNA strands occur in the 5’ to 3’
direction.
The nucleotides are called adenine (A), cytosine (C), guanine (G) and thymine (T). Under normal conditions, some of these instructions will be interpreted to produce messenger RNA, which is then translated to proteins, proteins that have important roles the chemicals processes that happen inside the cell.
RNA uses the same base nucleotides, with the exception of thymine (T) being replaced with uracil (U). RNA (ribonucleic acid) is a single strand molecule, of smaller length. It participates in many regulatory and protein building cell processes. But it is less stable which is undesirable for long term information storage.
The process of reading DNA into RNA is called transcription. The regions of DNA that are transcribed to RNA are called genes. Genes are sub-divided into alternating exons and introns
1 Zephyris - CC BY-SA 3.0 https://commons.wikimedia.org/w/index.php?curid=15027555
Figure 1 – DNA molecule1.
(see Figure 2). After a RNA molecule is transcribed from DNA, the introns are spliced out by the spliceosomes, and the exons joined together. Introns are identified by specific nucleotides sequences which promote the splicing.
Figure 2 – Intron and exons2.
The splicing process itself is not deterministic. Whole exons can be kept, or removed together with introns, some exons cam be extended into the intron, or some introns kept. This process of splicing a gene in different ways is called “alternative splicing”. The different splice variants of the same gene are called transcripts [9] or isoforms. RNA molecules are all produced by this process, but only some are further used to encode proteins. RNA that encodes proteins is called messenger RNA, shortened as mRNA, and the transcripts are said to be coding. Regions of DNA that encode mRNA are said to be coding too.
The mRNA molecule is translated to protein in the ribosomes. Each consecutive group of three nucleotides encodes for one amino-acid. Translation beings on the start codon (ATG) and finished at a stop codon (TAA, TGA, TAG). Given translation is bounded by the start and stop codons, this process might skip nucleotides in either tips of the transcript. These ignored tips are called the “untranslated regions”, shortened as UTR. The translation of three nucleotides to an amino-acid obeys a fixed code, called the genetic code or central dogma (see Figure 3). The amino-acid have unique chemical properties and then chain folds, giving the protein its three dimensional shape. The shape and chemical properties determine the protein’s function [8].
Proteins are composed by 20 different amino-acid which form a single chain between hundreds to thousands of amino-acids long. Each amino-acid has a name and is identified by a single unique letter (see Figure 3).
In the case of eukaryote cells, DNA is a long lived molecule preserved inside the nucleus and mitochondria that is copied during mitosis and meiosis. But RNA and proteins are produced often, are short lived, and once they have fulfilled their purpose, they are broken down and recycled to form new RNA and proteins [8].
There are many genetic codes. The standard one (Figure 3) is the one relevant for humans and other animals, but there are other genetic codes that apply to mitochondrial DNA, or bacterial DNA [10].
2 miguelferig - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=15733806
Figure 3 – Genetic code3.
2.2. The immunogenic journey of peptides in a cell
During the protein recycling process, proteins are split into smaller peptides, with just a few amino-acids. Which exact peptide sequences are produced is not random as there is a higher probability of the protein being split at specific locations by proteolytic enzymes or the proteasome. In terms of the MHC Class I pathway, the peptides may be transported by a TAP1 or TAP2 molecule and then bound to a MHC Class I molecule. The MHC molecules are themselves proteins that have nigh variability between individuals and which variability affects which peptides they bind to. The MHC molecules expose the peptides on the cell surface. The MHC bound peptide (now themed “epitope”), if recognized by a T-cell, targets the cell for destruction. The whole process of breaking down the protein, transporting the peptide, binding to the MHC molecule, and transporting it to the cell surface represents a bottleneck in which each step decreases the probability of peptides being exposed [11, p. 198]. In the context of neoantigen discovery, one’s interested in peptides with lengths between 8 to 12 amino-acids which can bind to a MHC Class I molecule, or peptides between 9 and 26 amino-acids which can bind to a MHC Class II molecule. A peptide of a length k is generically called a k-mer, or in the case of a specific length, like 9, it is called a 9-mer.
3 Mikael Häggström – Public Domain https://commons.wikimedia.org/wiki/File:Notable_mutations.svg
The human leukocyte antigen system (HLA) is a system of genes responsible for the encoding of the human MHC molecules (major histocompatibility complex) [12, Ch. 5]. It plays a central role in immune function, as it is the main mechanism by which cells expose peptide fragments on the cell surface to be recognized by T-cells. This mechanism also exposes extra cellular pathogens by antigen presenting cells, such as B-cells. This means there is selective pressure by pathogens or cancer cells to evade detection by the MHC molecules.
Individuals have several HLA genes, located in chromosome 6. There are the HLA-A, HLA-B and HLA-C genes that encode the MHC Class I proteins; and there are the HLA-DQ, HLA-DR and HLA-DP genes that encode the MHC Class II proteins. Given chromosomes exist in pairs, each individual has 12 major HLA genes. There are also the minor HLA genes like HLA-F but their role in protein coded antigens are less relevant.
The HLA genes are highly polymorphic; their alleles (nucleotide sequences) have a lot of variability within a population which results in high variability of immune function [12]. If immune function is of interest for a given experiment, HLA typing becomes a mandatory step.
2.3. Cancer as a genomic disease and the generation of neoantigens in tumours
Cancer can be characterized as a spectrum of diseases characterized by the accumulation of mutations in the DNA of a group cells, which can grow to become tumours and metastases.
These mutations are called somatic mutations, as opposed to normal mutations inherited from
Figure 4 – MHC Class I antigen processing pathway (image courtesy of OncoImmunity).
the progenitors, called germline mutations in healthy cells. Somatic mutations can range from single DNA nucleotide changes to large structural rearrangement of chromosomes. [13]. When these mutations occur in protein coding genes, they can translate to mutations in the protein themselves. Somatic mutations can also affect regions that promote gene expression. Gene expression is the process by which genes and transcribed RNA. If gene expression is affected, proteins that were not expressed or expressed at low levels can become expressed in significant quantities. The opposite is also relevant. Somatic mutations can cause genes to become silenced.
Changes in gene expression directly affect the abundance of potential neoantigens.
Just like normal proteins, mutated proteins are also recycled and the resulting peptides are potentially exposed on the cell surface by the antigen processing and HLA mechanisms. If the unique mutated peptides resulting from somatic mutations are exposed, these can be used as a target to destroy the cancer cell. Cell selection and destruction due to somatic mutations is in fact a common occurrence. Cancer happens when the cells manage to circumvent at least one of the mechanisms that cause the immune system to detect the cancer cell [14].
Different cancer type have different mutation profiles. Lung cancer and melanoma are cancers that typically have a high number of mutations, while lymphoma has a small mutation profile. The mutation load is directly related with the probability of producing neoantigens. When that probability is low or zero an immunotherapy cannot be created.
Epitopes can originate from both healthy DNA and DNA with somatic mutations, called respectively self-antigens and neo antigens. While neoantigens represent a potential target for immune therapies, therapies that recognize self-antigens might trigger auto-immune responses instead.
These biological principles are what’s behind emerging immune-therapies, but simultaneously, they’re what make area of research quite challenging.
3. Bioinformatics background
3.1. Next Generation Sequencing
The current state-of-the-art methods for DNA sequencing are called Next Generation Sequencing (NGS) or High-Throughput Sequencing [15]. These names are umbrella terms for different methods and machines which share some common principles [16]. Given the chemical and physical processes involved, different machines perform differently regarding throughput, error rate and read length.
Many machines implement a sequencing-by-synthesis process, where the DNA strand being read is used as a template to produce a new strand, being the reverse complement of the first one. When a new nucleotide is assembled in the new segment, the sequencing machine detects that event, using either electrical current or light, and records the nucleotide.
Different machines and methods have different properties like:
• Throughput: the amount of data that the machine can produce, broken down into run time (the amount of time it takes to read a sample), reads per run (number of base pairs read per sample), and data per run (mega or gigabytes produced per sample).
• Error rate: dependent on the machine’s sequencing method, the error rate can have a uniform distribution or change over time. The values can be <1% or in some cases around 16%.
• Read length: the length of each DNA chunk being read, measured in nucleotides.
Lengths vary from around 50 to thousands.
Sequencing experiments are setup within a budget of money and time, and with specific objectives depending on the research question, e.g. sequencing a new organism or finding genetic variants in a healthy individual of a known sequenced species. A part of the setup is to decide on the best sequencing method and how much data needs to be produced. The data being produced depends on two main variables:
• Breadth of coverage: how much of the DNA sample needs to be sequenced, e.g. the whole sample/genome, just the exome/genes, just specific genes, etc.
• Depth of coverage, shortened as depth: how many times should each base pair be read on average. Low read depths are acceptable when the error rate for single nucleotides is not a problem: for instance, using long reads as scaffolds for genome assembly. High read depths are a requirement when base pair accuracy is important: for
instance, when finding variants in single nucleotides. When sequencing healthy human DNA, variants have an expected frequency of 50% or 100%, due to heterozygosity (chromosomes existing in pairs). Figure 5 illustrates the relation between error rate and read depth and how they affect variant detection. Assume A is a random read error with a 1% frequency (per 100
nucleotides) and uniformly distributed, and both B and C are two real heterozygous variants, respectively with a 50% and 100% expected frequency (per reads over the variant site). In this example, if the read depth is 2X (each nucleotide read on average 2 times), the error A can be confused with the real variants B and C. If the read depth increases to 20X, there are ten times more reads, and ten times more errors. But the errors occur randomly and are distributed, they don’t accumulate, while the variants B and C are represented in the expected frequency (50% and 100%) over the correct position.
The output of NGS sequencing machines are FASTQ files [17]. FASTQ files are plain text files which contain the individual reads. Each read is represented by an entry in the file with four lines. The following is an example of an entry produced by an Illumina sequencing machine [18]:
@HWI-ST193:542:C2H0GACXX:8:1101:4404:2179 1:Y:0:ACACGA ATGCNTTTTATAATCAAAAGCGAAGACCTAGCAGGAGGTTAAAAACCTTT +
<<<<#2<@@5:9@44:@@?4(-8@(<9@<<658.==5=0<>??????9??
The first line is the sequence identifier, with metadata regarding the source of the data:
machine type, flow cell/well coordinates, etc. The second line is the DNA read. The third line is a plus, optionally following by the same sequence identifier from line one. The fourth line are the Phred quality scores [17] for each nucleotide, calculated as 10 , where is the probability of the base call being incorrect, and the score is encoded as an ASCII character.
NGS sequencing methods can be used to sequence either DNA or RNA. Whether DNA or RNA is read depends on the initial sample preparation work, and not on the sequencing
Figure 5 – Read depth versus error rate.
method itself. DNA sequencing and RNA sequencing methods serve different purposes. While DNA sequencing serves to study the genome (genotype vs. phenotype, genetic diseases, trait inheritance), RNA sequencing experiments take snapshots of samples at a given point in time to study gene expression: how genes are being transcribed to RNA [19]. For instance, how do external stressors (infection and disease, tissue damage, tissue growth, epigenetics) affect gene expression?
Some of the most relevant NGS methods follow.
3.1.1. Dye sequencing
Dye sequencing is a sequencing-by-synthesis process, used by the Illumina sequencing machines [20]. Sample preparation is the first phase. A sample of DNA is prepared first by fragmenting the DNA in small chunks, typically around 50 to 300 nucleotides in length. To each fragment is attached a primer, a small chunk of DNA in both tips with a well know sequence and properties, which allows the fragment to be manipulated. Each chunk of DNA attaches to a physical object, like a small bead, while those beads are evenly distributed to avoid overlaps in the physical support, also called a flow cell. Through a process called polymerase chain reaction (PCR), which exploits DNAs built-in cloning property, each chunk of DNA is cloned roughly a million times. Each bead with its cloned DNA fragments represents a cluster. The prepared sample is called a library.
The next step is the reading itself, which happens inside a sequencing machine. A reagent is added to the flow cell, where the DNA clusters are. The reagent contains pseudo-nucleotides which react with the existing nucleotides. These pseudo-nucleotides will assemble to clone once more the DNA to read. This assembly happens sequentially, from one tip of the DNA segment to the other. The pseudo-nucleotide is chemiluminescent (the dye), and emits a photon when it attaches. There is a pseudo-nucleotide counterpart for each normal nucleotide, and each emits a photon of different wavelength. A camera picks up the photon and records the event. However, one photon alone is hard to detect. That’s why the DNA chunks are cloned so photon emission is amplified. This process of synthesizing a new DNA molecule with the chemiluminescent nucleotides happens simultaneously and synchronized for the million DNA chunks in a cluster, all emitting photons with the same colour simultaneously, amplifying the signal and making it much more reliable for the camera to detect [20]. However, this process can get out of sync after a couple hundred bases, when the cloning of the different DNA chunks in the same cluster can eventually diverge and happen at different offsets, causing photons for different nucleotides to be
emitted simultaneously and introducing error in the signal. The consequence is that machines that rely on this process, introduce errors in the end of the DNA chunks, and the errors are typically mismatches, meaning, nucleotides misidentified as being another nucleotide. But the error rates are low, typically <1% [5].
3.1.2. Single Molecule Real Time Sequencing
Single molecule real time sequencing is also a form of sequencing by synthesis, implemented by the Pacific Bioscience sequencing machines [21]. It starts also with sample preparation, where the DNA is split into chunks of several thousand nucleotides. There is no PCR amplification step. The sample is applied onto a small plaque to be read by the sequencing machine.
The plaque contains microscopic wells, each with a DNA polymerase molecule in its bottom. DNA polymerases are natural molecules that participate in the DNA cloning process. A chunk of DNA to read attaches to a DNA polymerase molecule in one well. A reagent mix is added to the plaque, which contains fluorescent nucleotides. The DNA polymerase conducts the cloning of the DNA chunk by attracting a fluorescent nucleotide into its cleft, clipping off the fluorescent tip and attaching the rest of the nucleotide to the new strand that is being synthetized.
At the bottom of the well, light is being shined, which reflects off the fluorescent bit of the nucleotide just attached. The wavelength of the reflection is used to identify the nucleotide that was just attached [22].
The advantage of this method, compared to sequencing by synthesis with PCR, is that because it relies on DNA polymerases, it can read chunks of DNA which are much longer, several thousands of base-pairs. These longer reads make genome assembly and structural variation detection much more reliable, as they are able to span long repeats, regions of low complexity (repeated nucleotides like CAGCAGCAG), or introduce lots of context around the split points of structural variants. But there are also disadvantages. Often, when a fluorescent nucleotide enters the well, it might not go deep enough for it to be visible when reflecting light or the single flash of light might be hard to detect; two fluorescent nucleotides might also react too quickly, making it hard to differentiate between two different events; DNA polymerases degrade over time. The consequence is that errors happen evenly during the reading process, typically deletes, or ambiguous inserts plus deletes around regions with repeated nucleotides (AAAA vs.
AAA). Errors rates currently are around ~16% and uniformly distributed [5].
3.1.3. Ion semiconductor sequencing
Ion semiconductor sequencing is a form of sequencing by synthesis, used by the Ion Torrent machines [23]. Like sequencing by synthesis, the sample is prepared by chunking the DNA, attaching to beads, doing the PCR step and then spread over a physical support. The physical support, a semiconductor chip, contains small wells where each beads lands.
Each well is flooded sequentially with a solution that contains only one of the four nucleotides (A, C, G, T). When a nucleotide binds to the DNA chunk, it releases a hydrogen ion (H+) which changes the pH of the solution. If there are many consecutive equal nucleotides in the DNA chunk, many nucleotides in the solution will bind, releasing more H+ ions causing a bigger change in pH. This change is pH is measured to detect how many nucleotides bound to the DNA chunk. The solution is flushed away, and a new solution is added with a new nucleotide, and the cycle repeats again.
This method does not use light to detect the nucleotides. Instead it uses an electrical current, making it more reliable than using a camera to detect light. Error rates are low, around 1% [5]. During very long repeats (more than several 7 repeated nucleotides) the solution can become saturated with H+ ions, making it hard to detect voltage differences between, for instance, a repeat of 8 or 9 nucleotides. The consequence is that errors are typically insertions and deletions, and uniformly distributed. Given this method does not use modified fluorescent nucleotides, it is also cheaper that competing machines that rely on that kind of chemistry. This method is also limited to short read lengths, less than 200 nucleotides.
3.1.4. Nanopore sequencing
Nanopore sequencing is the method implemented by the Minion machines, produced by Oxford Nanopore Technologies [24], among others. The Minion machine is a handheld device, which can be plugged easily to a computer using an USB port, unlike other machines which occupy table tops or rooms.
Nanopore sequencing requires minimal sample preparation, no DNA fragmentation, and nor PCR amplification. Quite simply, the sample is placed on a physical support which contains a microscopic pore, with a diameter in the nanometre scale. The DNA molecule is sucked from one tip to the other through the nanopore due to electrophoresis. While the DNA molecule is being passed through the nanopore, the latter registers changes in the magnetic field of the part of the molecule inside the nanopore which induces a current into two electrodes [25]. There are
also versions of nanopore sequencing which use fluorescence (but require sample preparation and amplification) or biological processes.
The nanopore is not small enough to read one nucleotide at a time. Instead, it reads 6 nucleotides simultaneously. The molecule meanwhile shifts by one nucleotide, and the pore reads more 6 nucleotides, 5 from before, and one new one. Lastly, software interprets the electric signals sent from the electrodes and determine which nucleotides were read.
The read length is arbitrarily long, like tens or hundreds of thousands of nucleotides.
However, the error rate is still high, 15% [26], distributed similarly between inserts, deletes and mismatches, because changes in the magnetic field as the molecule passes through the pore might be hard to disambiguate. Like in ion semiconductor sequencing, the nanopore has trouble identifying long repeated segments with the same nucleotide, above sizes of 6.
3.2. DNA read assembly and mapping
Consider the following output from a DNA sequencing experiment, where the first line in bold is the real DNA sequence, and the next 4 lines were the chunks detected by the machine.
TGGCGGGCTGCGCTTCCGAGGAGAGCTGCATG CGGGCTGCGCTTCCGAGGAGAG
CTGCGCTTCCGAGGAGAGCTGC GGCGGGCTTCGCTTCCGAG
CTGCGCATCCGAGG
3.2.1. De novo sequencing
In case the species has never been sequenced, the real DNA segment is not known a priori.
Therefore the DNA reads need to be assembled together to produce a consensus sequence which represents the original DNA as accurately as possible. This is done by finding overlaps between the DNA reads. There are several algorithms, the two most common ones being the Overlap- Consensus-Layout and de Bruijn graphs. Both methods lookup similar sub-segments in DNA reads which are treated as vertexes or edges to build a graph. Walking through all the vertexes or edges then yields the assembled sequence. Often, walking through the graph can be ambiguous and yield more than one path. This is the consequence of having short read lengths that start or end in regions with low complexity or many repetitions. Therefore, it is desirable for an assembly experiment to have reads as long as possible. As described before, the long read technologies
have high error rates, so assembly experiments can be complemented with short read high quality data to filter errors in individual nucleotides.
The output of a genome assembly is a FASTA file, like:
>segment1
GGCCAACTCATTGCCTGTGGTGTGCATGTGAC…
>segment2
ATCGCTGCATTTGAAAATCTTCAGGCCACCGTGCCAGTGATAAGTAGC…
Genome assembly is not sufficient to know which genes exist and where they are located.
Genes can be discovered using gene prediction algorithms or using other methods of sequencing that isolate individual chromosomes and/or genes [27].
Sequencing experiments that do not rely on a previously sequenced genome are called de novo or ab initio.
3.2.2. Standardized reference genomes
Different species have been sequenced to varying degrees of quality and their genomes have been built and are continuously improved by responsible research bodies, which release the genomes for public use. These genomes are called references. As sequencing methods become more accurate, and parts of the DNA that could not have been sequenced before are sequenced (e.g., regions with low complexity or long repetitions), the references improve over time.
The human reference genome is an obvious example of well characterized and publicly available reference genome. Currently, the main human reference genome is maintained and developed by the Genome Reference Consortium [28]. The latest major release of the human genome is called GRCh38 initially released in December 2013, and being continuously updated.
Other species with well know and characterized genomes are called model species, like the mouse (Mus musculus), the fruit fly (Drosophila melanogaster), or yeast (Saccharomyces cerevisiae) [29].
Along the main chromosomes, reference genomes also contain the list of genes and transcripts. There are many re-distributions of the reference genomes from the GRC by different institutions, which complement the reference with their own sets of genes and metadata. We’ll consider two main ones: RefSeq [30][31, Ch. 18] and Ensembl [32]. Both RefSeq and Ensembl include high confidence data, being based on existing literature and manually curated sequences, and include predicted data, based on automatic methods but waiting for empirical confirmation.
In both cases, the transcripts can be differentiated by their confidence and high level function, given their naming convention or metadata.
For RefSeq, the convention is as follows [31, p. 309]:
• Transcripts with name starting with NM from RefSeq are protein coding transcripts.
• Transcripts with name starting with NR from RefSeq are transcripts that encode non- coding RNA.
• The protein sequences themselves are named starting with NP.
• Transcripts starting with XM, XR and XP from RefSeq are transcripts which are generated by the eukaryotic genome annotation pipeline [33] lacking empirical confirmation.
• Gene names do not obey a fixed naming scheme.
• The names starting with NM, NR, XM, XR and XP are also called “accession numbers”.
For Ensembl, the convention is as follows:
• Gene IDs start with ENSG, transcripts IDs start with ENST, protein sequences start with
ENSP.
• For other species rather than human, the prefix is longer, like ENSMUSG, ENSMUST and EMSMUSP for the mouse.
• Further information whether the transcript is coding or not is given by the transcript biotype. The most common value is “protein_coding”. There are many other values like
“nonsense_mediated_decay” or “pseudogene”.
The reference genomes can also be formatted and stripped for specific use cases. For instance, for the sake of whole genome sequencing, the reference genome will contain the full sequences of all chromosomes. For the case of variant annotation, it’s most likely that only the exons and gene plus transcript coordinates are needed. The file formats are also standardized.
The FASTA file format [18] is the norm for the files containing DNA and protein sequences.
The GTF file is the file format containing coordinates plus metadata for genes and transcripts.
3.2.3. DNA sequencing with a reference
If a species has been sequenced previously and its genome or part of it has been assembled, this previous assembly can be used as a reference. The new DNA reads are mapped onto the previous reference assembly. Mapping is an important step of re-sequencing experiments, when
the most likely position of DNA reads in the reference is located, so the relative position of the reads is known, and a consensus sequence can be built.
In the DNA example in 3.2, two of the reads had two differences against the real DNA sequence, underlined. During the mapping process, the mapper program needs to tolerate mismatches and small insertions and deletions, in order to match the rest of the read. Once the position of a read is known, its differences against the reference are called the variants.
Re-sequencing experiments serve many purposes. These can be done in the context of a single individual, a group of close relatives (family), or a whole population (regional or global).
Re-sequencing helps explain how the genotype and its variations influence the phenotype (the visible characteristics); how genetic variations participate in disease; how genetic variants can be used for drug treatments and other therapies; forensics; to study biological evolution; etc.
The quality of the reference assembly influences how successful re-sequencing experiments are, e.g., if there is a high map rate of the reads and how much variation can be tolerated per read.
3.3. Variant calling
Re-sequencing experiments sequence individuals of species which genome is known a priori. In this context, comparing the reference genome and the newly sequenced genome is one of the main tasks that helps to characterize the target of the experiment, and it is more efficient as only the differences against the reference need to described.
Just like with de novo experiments, re-sequencing experiments also have to deal with the limitations of the input data, with noise, read length and read depth. Therefore, describing the differences between a reference and a newly sequenced target is also a complex problem.
Consider the example in 3.2 - there is a mutated T on the 3rd read. How does one know that is a real mutation and not error from the sequencing method? To answer this question, many competing solutions have been developed, which have their different use cases and there is not a one-size-fits-all approach.
The variant calling workflow has typically the following steps:
• Mapping the reads to the reference genome, finding their most likely position. Some reads might be mapped to more than one location, others might not be mapped at all.
The most popular algorithms for this step are all based on the Burrows-Wheeler transform (BWT) [34]. The BWT makes locating substrings of text (the DNA reads) in long blocks of text (in this case, the whole genome) quite efficient.
• Do a local sequence alignment, using an alignment algorithm like Needleman-Wunsch [35] or Smith-Waterman [36] to accurately calculate which are the nucleotides that match to the reference, which mismatch and which gaps need to be introduced.
Example:
a) TGGCGGGCTGCGCTTCCGAGGAGAGCTGCATG CGGGTGGGCTTCCGACCGAGAC
b) TGGCGGGCTGCGCTTCCGA-GGAGAGCTGCATG CGGG-TGGGCTTCCGACCGAGAC
• In the previous example, a) refers to the initial mapping with the reference in the first line and the mapped DNA read in the second. Although the position of the read was found, one can note that there is not a perfect match between both sequences. Example b) shows the effect of doing the local alignment, which calculates where gaps are placed and which mismatches (underlined) are accepted. Notice though that there is an ambiguity at the end of the read – accept a mismatch, or add a gap?
• Calculate the consensus sequence, by merging the mapped and aligned reads, and detect the differences against the reference. Due to heterozygosity it is expected that different variants can occur in the exact same site, which results in more than one consensus sequence and overlapping variants. Example:
TGGCGGGCTGCGCTT TGGTGCGCAGCGCTT TGGCGCGCAGCGCTT TGGCGCGCAGC-CTT TGGCGTGCAGC-CTT TGGCGTGCTGC-CTT 1 2 3 4
The reference is in bold and the variants are underlined. There one T in position 1, 3 Cs and 2 Ts in position 3, 4 As in position 3 and 3 gaps in position 4. Which ones are the true variants and not sequencing errors? Variant callers implement different heuristics to rank mutations against the error rate, like simple ratios as probability or using the Bayes theorem [37].
The output of the variant call is a VCF (Variant Call Format) file [38]. Example:
##fileformat=VCFv4.1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample_1_name sample_2_name chr1 24383998 . A T 1453 PASS DP=12 GT 0/1 0/0
The fields CHROM (chr1) and POS (24383998) depend on the reference assembly used, which tell the position of the nucleotide(s) that changed in the reference. The field ID can optionally be filled by the variant caller and contains an identifier that links that variant to 3rd party databases in case of a previously know and documented variant, like frequency in populations and geographical distribution, or potential drug treatments in case it’s a variant that has impact in illness. The fields REF and ALT contain the original (from the reference) and modified nucleotides, respectively. The field QUAL is a numerical score given by the variant caller about the confidence of the variant call (bigger is better). FILTER, INFO and FORMAT contain extra metadata produced by the variant caller extracted from the reads that were used to find the variant and the variant call process. VCF stands for Variant Call Format and was introduced by the 1000 Genomes Project having meanwhile become a de-facto standard in the field.
Variants are named depending on the size differences between the REF and ALT:
• SNP: single nucleotide polymorphism, when one single nucleotide (REF) changes to another single nucleotide (ALT). Example: chr1 24383998 A T
• MNP: multi-nucleotide polymorphism, when both REF and ALT are longer than one nucleotide and of the same length. Example: chr1 24383998 AT TC
• Insertion: a clean insertion of nucleotides. REF is empty while ALT has the new nucleotides. Example: chr1 24383998 A ATGC
• Deletion: a clean deletion of nucleotides. ALT is empty while REF has the nucleotides from the reference. Example: chr1 24383998 ATCA A
• Substitution, or indel: insertion plus deletion, or a substitution of different size, when some REF nucleotides are replaced with new ALT nucleotides, and the length of REF and ALT are different. Example: chr1 24383998 ATTG ACC. The name indel is normally used to refer to all insertions and deletions.
The aforementioned steps of the variant call workflow are implemented by different programs, developed independently or as part of a bundle. The article “Tools for mapping high- throughput sequencing data” [39] lists 60 different mapping tools, among which: BWA (Burrows-
Wheeler Aligner) [40]; Bowtie [41] and Bowtie 2 [42]; or SOAP2 (Short Oligonucleotide Analysis Package) [43]. For the variant calling there are, for instance, LUMPY [44], VarScan [45], MuTect [46], Strelka [47], SomaticSniper [48], and others. The output of the read mapper is typically a SAM or BAM file [49] which is standardized and supported by all variant callers.
Choosing the read mapping and variant calling tools depends on the experiment being conducted, the type of input data and expected output. To be considered:
• Read length: the longer reads are, the less likely they are to be mapped. The read mapper should have heuristics to overcome this limitation, like only matching a seed/substring of the read and doing a local alignment around the read. If the read is small, a simple direct mapping is more likely to be successful, but can occur in more than once place.
• Read errors: the variant caller should take into account the error rate and distribution of errors, else errors can be confused as variants.
• Type of DNA sample: In healthy diploid samples (two chromosomes), the variant caller can expect frequencies of 50% or 100% for variants. But cancer samples have much more variability [13, pp. 85–86] and variants can occur with any frequency.
• Intersecting with previous data: is this an experiment that is comparing two samples (e.g., healthy sample against cancer sample)? Is this an experiment that benefits from pulling metadata from 3rd party sources, like identifiers for normal variants in populations, or drug treatments in case of variants from diseases?
• Performance: for large amounts of input data, performance of the tool might become a critical issue, regarding CPU usage, memory consumption and disk space, which need to be within the experiment constraints.
• Haplotype reconstruction or Phasing: Reference genomes are homoploid (single chromosome) and read mapping alone does not tell which chromosome the read belongs to. Haplotype reconstruction [50] is the process of grouping variants per allele/chromosome, allowing further downstream work to know if variants occur in the same chromosome – being co-located - or different chromosomes. Haplotype information is stored in the VCF file as metadata for each variant, in the GT field of the FORMAT and sample columns.
3.4. Variant calling of somatic variants
Variant calling from germline DNA is a challenging problem, but variant calling of somatic mutations in cancer samples is further affected by more factors [13, pp. 85–86]:
• Heterogeneity: tumours are not a homogeneous mass of cells. Within a single tumour, different cells can have different mutations, and those subpopulations of cells can further mutate to produce more variation. This means that somatic variants can have a very low background frequency and be easily confused with the error rate. This also poses a challenge for immune therapies which might not be able to target all cells in the tumour.
• Purity: tumours contain normal healthy cells mixed with mutated cells, further reducing the frequency of somatic variants.
• Ploidy: while normal human cells contain 23 pairs of chromosomes, tumour cells can contain partial or whole copies of chromosomes, or chunks of chromosomes missing.
These cloned chunks of chromosomes will cause DNA in those regions to be amplified, hindering the read depth in other regions.
• Sample quality: the pre-sequencing sample work needed to handle tumour samples and extract DNA, like the use of formalin, can cause non-reproducible changes or damage the DNA, reducing the amount of DNA that can be read or its quality.
The performance of somatic variant callers is a common theme in the literature because of the aforementioned issues. So far, it is understood that variant callers perform very differently from each other, producing different sets of variants given the same input data. This poses a serious confidence problem, which leads to another consideration: use one tool or many and produce an intersection? It also means that as it stands the current tools are unreliable to be used in critical settings [13, p. 90], [51]–[55]. This problem is so challenging that there is the DREAM challenge [56], which is a competition of teams developing their variant callers targeting controlled benchmarks and test suites developed specifically for that competition.
Figure 6 contains a Venn diagram from a review [51] intersecting the results from different variant callers with short DNA reads from chronic myeloid leukaemia samples, produced, produced by the Illumina HiSeq. Out of the universe of variants produced, there’s only a small intersect, which either points to many false positives (the variants called by just one tool) or many true negatives (the variants that some tools failed to detect). While the intersect might produce a set of high confidence variants, it’s nonetheless remarkable that tools perform so wildly different.
3.5. Variant annotation
The variant annotation and interpretation step is of special important for neoantigen discovery. The variant annotation step starts with the variants (the VCF file) and locates the genomic regions and transcripts affected by each variant. For genomic regions outside genes, the annotation tool might report changes in regulatory regions and motifs that affect gene expression, for instance. For genes and transcripts, the tools will report nucleotide changes, which can have effects on the translation of RNA to protein.
Variants are said to be “exonic” if they affect the exons in genes, “intergenic” if they do not affect genes, “intronic” if they occur in introns, and “splicing” if they affect intron/exon splice sites. Furthermore, variants in exons can affect the coding part of the mRNA molecule, which translates to protein, or affect the untranslatable regions (UTRs).
Nucleotide polymorphisms, changes in one or more nucleotides, without removing or adding nucleotides, can be classified according to their effects in the protein sequences affected.
Synonymous mutations are DNA mutations which have no effect in the translated protein sequence, because the modified codons produce the same amino-acids as the original ones. There are several codons in the genetic code which share the same amino-acid (see Figure 3). When the original and mutated codons encode for different amino-acids then amino-acid changes are produced. If one of the original or mutated codons are a stop codon (TAA, TGA, TAG), the mutation is a “stop loss” or “stop gain”, respectively. A “stop loss” causes translation of the protein to continue into the 3’UTR region by changing the original stop codon into a regular amino-acid. Translation might then hit a new stop codon, or the end of the mRNA molecule.
Figure 6 – Overlap of variants produced by different variant callers [51, p. 2226].
The translation of a variant that introduces a stop codon before the original stop codon is a “stop gain” and the protein sequence is truncated.
Insertions, deletions and indels can come in two flavours: multiples of three nucleotides or not. Changes of three nucleotides will cause full codons to be added to or removed from the mRNA, which translate as regular amino-acid insertions and/or deletions. The inserted or removed codon can also encode a stop codon, respectively, “stop gain” and “stop loss”. Changes not multiples of three nucleotides cause what is referred as a frame shift. Translation happens in groups of three codons so any change in nucleotides that is not a multiple of three shifts the translation frame by one or two nucleotides, so all nucleotides further downstream are also shifted. As example, consider a DNA sequence AAA CCC, which encodes the peptide KP and the following possible mutations:
Mutated allele Mutation Protein mutation Consequence Peptide
AAA TCC c.4C>T p.P2S missense KS
AAA CCA c.6C>A p.P2= synonymous KP
AAA TTT CCC c.3_4insTTT p.K1_P2insF inframe insertion KFP
ACC c.2_4del p.K1_P2delinsT inframe deletion T
ATT TTC C-- c.2_5delinsTTTT p.K1_P2delinsIFX frame shift IFX
TGA CCC c.1_2delinsTG p.K1* stop gained *P
Nucleotide mutations can also happen over the start codon, causing a full loss of the transcript, or over exon splicing sites, causing changing in splicing, which further complicates understanding how “alternate splicing” works.
Substitutions can affect more than one consecutive nucleotide, and be combined with indels, causing more complex mixes of amino-acid changes. If a mutation has downstream consequences (a “stop lost” or “frame shift”), any further downstream can too be affected, including mutations in the 3’ UTR given the stop codon might be affected by a frame shift.
The specific nucleotide and protein changes can be described using the HGVS standard nomenclature [57]. For instance:
• DNA changes are described in a format like c.[change]. The format for [change]
can be paraphrased as [number][AGCT]>[ACGT] in the case of SNPs; the format is similar to [range_range]{ins,del,delins,dup}[ACGT]+ for more complex variables, that span more than one nucleotide.
• Protein changes as described as p.[change]. The format for [change] can be paraphrased as[amino_acids][position][amino_acids], in the simple missense SNP case. More complex example include a peptide range and flanking peptides, like
p.A10_C12delinsEET. A synonymous SNP example is p.C10=.
These two descriptions are simplifications just to illustrate what the HGVS standard produces and are not exhaustive nor entirely correct.
To summarize:
• Variant can be said to have one or more consequences, different consequences have varying impacts on the protein.
• Variant annotations over splicing sites are hard to describe, as alternate splicing is not well defined.
• Combinations of variants located in the same allele or DNA region can produce combined effects, which can be quite interesting to explore, although the field of bio informatics has been focused in analysing individual variants.
The terms “synonymous”, “missense”, “stop gained”, “stop lost”, “frame shift”, “in-frame insertion”, “in-frame deletion”, among many others, are defined in sequence ontology [58] and are called “consequences” or “functional annotations”.
Consider the following example: the variant chr1 114713907 T A affects gene NRAS, with Ensembl ID ENSG00000213281. More precisely, it affect the transcript ENST00000369535, by changing the codon CAA at position 435 to CAT, which encode respectively the amino-acids Q and H. This causes the protein around that site to change from [DILDTAGQEEYSAMR] to [DILDTAGHEEYSAMR]. These changes have impact in the chemical and physical properties of the protein, which can affect its function. It also means that when the protein breaks down into peptides, the peptides that contained the Q amino-acid originally, now contain H, for instance, [TAGHEEYSA]. With this information, the list of peptides around the mutation can be built for any given peptide lengths.
Continuing the example, generating a list of 3-mers and 5-mers, peptides of length 3 and 5:
DILDTAGHEEYSAMR AGH
GHE HEE DTAGH TAGHE AGHEE GHEEY HEEYS
The list of mutated peptides in the context of neoantigen discovery poses several further research questions.
• Is this peptide novel, never seen before in this individual or organism?
• Can it be transported by the TAP molecule and after bound to the MHC molecule? Can it be recognized by a T-cell ?
• Is it sufficiently abundant? Answering this question relies in a different experiment, using RNA-seq.
There are several competing tools which do variant annotation. Following is a list of a few selected tools.
3.5.1. VEP – Variant Effect Predictor
VEP [59] is the tool produced and maintained by the Ensemble project [32]. VEP is being actively developed, and has a modular architecture, which allows VEP to run in different scenarios, like as a standalone local tool or as a web service. VEP supports both the Ensembl and RefSeq reference transcripts. Users can either use VEP online or download and use it as a local tool. VEP is extensible with plugins, which can modify the output data.
3.5.2. ANNOVAR
ANNOVAR [60] is a tool developed at Columbia University, free for academic and non- profit use. It supports primarily the RefSeq reference genomes for variant annotation, although the Ensemble reference genomes can be converted to ANNOVAR’s format. Furthermore, it