• No results found

Bioinformatics and machine learning for in silico prediction models (Papers I, II, III)

8. Methodological considerations

8.3 Bioinformatics and machine learning for in silico prediction models (Papers I, II, III)

Machine learning is a term covering all methods machines (computers) learn from data. This includes supervised learning like classification, regression and ranking, unsupervised learning like clustering and principal components analysis and reinforcement learning (178). Neural networks can be used for supervised learning, but unlike linear and logistic regression or other classic functions, the neural network learns without being told what shape the data should have (the function it should follow). The neural network learns from the data, and teaches itself what shape the function has, and is often termed deep learning. As nature is complex, the need for complex models is immense in

medicine, and deep learning neural networks provide new ways forward (179, 180). Examples of use include classification and pattern recognition of images in radiology (181) and more relevant for my topic, prediction of epitopes in immunology (a comprehensive list of examples can be found in (182)).

Recently, powerful and open algorithms utilizing mainstream graphic processor units (i.e. Google’s Tensorflow, https://www.tensorflow.org/) have become available to the public, allowing anyone with sufficient knowledge to train advanced neural networks in their office or at home.

8.3.1 Neural network prediction of HLA affinity

The predicted peptide HLA affinity is an important tool for limiting selection to relevant epitopes to assess in vitro. As briefly mentioned above, several tools of predicting epitopes exist and all include prediction of HLA affinity using neural networks or other predictive models. For this thesis we have collaborated with Robert D. Bremel and E. Jane Homan in ioGenetics LLC. They had

developed an HLA affinity prediction model utilizing principal component analysis of amino acid physical properties data (Figure 11), and known affinities expressed as ln(IC50)18 for input (183, 184).

The affinities were procured from the immune epitope database (IEDB) (185). During analysis the first three principal components explained 90% of the variance and was used for further analysis. The first component was correlated with hydrophobicity or polarity, the second with size and the third possibly with electric charge (184). As their system was built using a commercially available software (JMP©), predictions could be performed on desktop computers instead of supercomputers. However, the procedure does not limit itself to this platform, and may be utilized in context of other frameworks as well.

18 Affinity for HLA molecules is typically reported as IC50 (half maximal inhibitory concentration), a

quantitative measure indicating how much of the substance is needed to inhibit binding by 50% in competition binding assays. Lower values indicate higher affinity. Values used for training, and thus for output, were natural logarithms (ln) of IC50.

30

Figure 11. Principal components of amino acid physiochemical properties

All twenty amino acids have distinct physiochemical properties. Based on 31 data points for each amino acid, Bremel and Homan performed principal components analysis and found that the first 11 principal components

(y-axis) explained 99% of the variability (x-axis), as shown in the left panel. The first three explain approximately 90%, and the right panel shows the relationship between the first two, explaining 83% of variability. Both affinity and cathepsin cleavage prediction models utilized the first three principal components

as input values for training and predictions.

8.3.2 Neural network prediction of cathepsins cleavage

In protease cleavage assays, eight amino acids, four before and four after a potential cleavage site are typically labelled as P4P3P2P1 | P1’P2’P3’P4’, where | is the cleavage site in the cleavage site octamer and the numbers indicate the relative distance from it. Some enzymes have clear preferences for certain amino acids in certain positions; Trypsin cleaves when lysine or arginine is in position P1 (186) while legumain prefers aspartic acid or asparagine in P1 (187). Cysteine cathepsins however, are much less specific and rather promiscuous when it comes to amino acid preferences (examples can be found in the MEROPS19 database (188)). Predicting these enzymes’ activity is thus tougher than for specific enzymes and require models that can be trained without fully knowing the enzymes’ substrate specificity.

Similar to how HLA-affinity prediction algorithms were developed, ioGenetics also

developed a way to utilize principal component analysis of amino acid physical properties data (Figure 11) to predict cathepsins cleavage probability (189), a method that was further validated in paper II.

Importantly and unlike the HLA-affinity models using continuous IC50 values, the input and output for this model were binary, using cleave or non-cleaved as input to estimate a probability between 0

19 MEROPS is a comprehensive database of peptidases and their activities.

31

and 1 that cleavage would occur. Instead of using the typical 9-mer (for HLA-I) or 15-mers (HLA-II) as input peptides, the algorithm used cleavage site octamers. However, the actual neural network ensembles used for prediction were limited to assessing the cleavage dipeptide P1P1’ due to technical limitations.

The models were trained with data from proteomic identification of protease cleavage site (PICS) assays, where trypsin or GluC predigested proteome libraries were digested with cathepsins S, L or B and assessed using nLCMS to identify cathepsins cleavage sites (190). As GluC and trypsin have dissimilar cleavage site preferences, bias introduced by pre-cleavage was reduced, but not eliminated from the training set. Additionally, the input dataset did not cover every possible cleavage site octamer or even all possible P1P1’ dipeptides, and trained models could be limited by this as well.

8.3.3 T cell exposed motifs

In the context of a peptide-HLA complex, only parts of the bound peptide will be “visible” to interacting T cells, while other parts will be “hidden” in the HLA groove. The exposed parts, the TCEM, are different for HLA-I and HLA-II bound peptides. Using data from Rudolph et al (40) and Calis et al (41), three patterns of TCEM were deduced by Bremel and Homan (39). TCEM I include residues 4, 5, 6, 7, 8 of a 9-mer in the context of HLA-I, and TCEM IIa and TCEM IIb includes residues 2, 3, 5, 7, 8 and -1, 3, 5, 7, 8 of a 15-mer with a 9-mer core in the context of HLA-II (Figure 12). Importantly, TCEMIIa/b were constructed with data from presentation on HLA-DR, and interpretations in context of other HLA-types should be mindful of this.

Figure 12. T cell exposed motifs

“Two T cell exposed motifs (TCEM) in context of peptide:HLA-DR binding. TCEM IIa consists of amino acids 2,3,5,7,8 and TCEM IIb of -1,3,5,7,8 in a 9-mer core of 15-mers (-3,-2,-1,1,2,3,4,5,6,7,8,9,+1,+2,+3).

The non-linear 5-mer TCEM are the deduced sequences T cell receptors (TCR) may interact with, as the other amino acid residues remain hidden in the HLA-groove.

There are theoretically 3.2 million (205) of each type.”

Figure and text from Paper III – Høglund et al., manuscript (2020).

The frequency of TCEM occurrences within the human proteome (UniProt without IGHV (191)), human microbiome (from the National Institute of Health Human Microbiome Project (192)) and for two different IGHV databases (DeWitt et al. (9), and Johansen et al. (140)) were calculated

32

and expressed as frequency classes (FCs), as first shown by Bremel and Homan (39). This is a reverse log2 scale, where FC 0 (1/20=1) indicates the motif occurs in every IGHV sequence, and FC 21 (1/221) indicates it occurs approx. once every 2 million IGHV sequences. For proteome and microbiome occurrences, Johnson SI transformation of log2 values were used.

Unlike previous calculations that utilized full length IGHV sequences (39), FCs were for this work calculated using databases with only part of the IGHV sequence (9, 140), but with a much larger repertoire (see section 8.2). This caused the FC range to extend, ranging from 0 – 22, as opposed to 0 – 16 previously.