• No results found

Machine learning is a field within the artificial intelligence field, where a computer is given the ability to learn without being explicitly programmed [136]. By feeding a machine learning method with multiple examples (training data) it is able to learn from the data, and generate rules that can be applied to predict outcomes for new samples [137].

For the purpose of multivariate analysis, with ݊ samples over ݌ variables, it is common to condense the data into an ݊ × ݌ dimensional matrix ࢄ, where each row and column represents one sample and variable (metabolite), respectively. Multivariate statistical methods can be subdivided into two types: unsupervised and supervised methods. Unsupervised methods aim to describe naturally occurring patterns in the data, without any knowledge on the response variable. Supervised methods make use of the labeling to create models, which can be used for predicting the outcome for unobserved data.

NMR data are highly collinear, making standard statistical methods unsuitable. In addition, the number of variables (݌) often exceeds the number of samples (݊). Multivariate methods utilizing the use of latent variables are thus commonly used for analyzing NMR data.

Principal component analysis (PCA)

Principal component analysis is an unsupervised statistical procedure, which uses orthogonal transformation to convert a set of observations of correlated variables into a set of linearly uncorrelated variables, called principal components (PCs). PCs are linear combinations of the original variables, where the first principal component, PC1, is chosen in the direction along which the samples show the largest variation [138], as illustrated in Figure 1.12. The second principal component is orthogonal (thus uncorrelated) to the first principal component in the direction of the greatest remaining variation. In this way, the dimension of the original data set is reduced, and most of the variation in the data is explained by few new variables (PCs). When the number of variables is larger than the number of samples and there is a high degree of collinearity, PCA may greatly reduce the dimensionality of the data without loss of information.

Mathematically, the decomposition of ࢄ by PCA results in ࢄ = ࢀࡼ + ࡱ, where ࢀ and ࡼ are the scores and loadings matrices, respectively, and ࡱ is a matrix of residual variance which is not a part of the model [138]. Here, indicates the matrix transpose. PCA has only one tuning parameter, which is the number of principal components. This is directly related to the residual variance matrix ࡱ and as the number of included PCs in the model increase, the residual variance decreases. The optimal number of PCs can be determined from a scree plot, where the variance explained is plotted against the number of principal components in the model. Typically, the curve should be steep at first, then have a bend, followed by a flat part. The number of PCs that are associated with the bend should be chosen. PCA is an unsupervised method, it is often used as an explanatory tool for detecting underlying patterns and outliers.

The results of a PCA are displayed in a score plot in the new coordinate system defined by the PCs.

This is a scatter plot, where each sample is projected down onto the new coordinate system, represented by a point. By coloring the points by groups of interest, the amount of separation between the groups can easily be observed. Also, any underlying patterns in the data can be revealed.

Complementary to the score plot is the loadings-plot, where the contribution of each original variable, its weight, on the PCs is shown. The scores and loadings can alternatively be plotted together in a biplot, to rapidly overview the correlation between sample scores and variable loadings.

1.4 Data analysis

Figure 1.12 Principal component analysis. The figure to the left shows data in the original 3-dimensional space, where points corresponding to two different groups are colored in different colors. The figure to the right shows the same points propagated down to a 2-dimensional space, where the highest variance along PC1.

Partial least squares discriminant analysis (PLS-DA)

Partial least squares discriminant analysis (PLS-DA) is a supervised statistical procedure, with many similarities to PCA. PLS-DA however is a combination of dimensionality reduction and discriminant analysis, in one algorithm [139]. The original data matrix ࢄ is decomposed into a set of new orthogonal variables, called latent variables (LV).

The general underlying model for a PLS-DA can be written as:

ࢄ=ࢀࡼ+ࡱ

ࢅ=ࢁࡽ+ࡲ

Where ࢄ is the original data matrix and ࢅ is the response matrix with class memberships. Further, ࡼ and ࢁ are the score matrices for ࢄ and ࢅ, respectively. Similarly, ࡼ and ࡽ are the loading matrices for

ࢄ and ࢅ, and ࡱ and ࡲ are the error matrices. ࢄ and ࢅ are decomposed in such a way that the covariance between ࢀ and ࢁ is maximized. Thus, the first LV will place the highest weights on the variables most strongly related to the response. There exist a number of algorithms for performing PLS-DA, of which SIMPLS [140] is most commonly implemented in statistical software, as it performs faster than the traditional non-linear iterative partial least squares (NIPALS) algorithm.

Model tuning and interpretation is similar to that of PCA. As PLS-DA is a supervised method, the number of LVs should be chosen based on the prediction accuracy through cross validation. As for PCA, the results of a PLS-DA model can be visualized by score and loading plots. The score plot shows

each sample projected to the new coordinate system spanned by the principal components, and the loading plot shows the weight of each original variable on the score.

PLS-DA can be extended through orthogonal projections to latent structures (OPLS-DA). This results in separating the predictive from non-predictive variation, in cases where there are more than one LV in the model. This extension improves the interpretability of the model, but not the predictive performance [141].

Multilevel PLS-DA

Multilevel PLS-DA is an extension of PLS-DA, effective for paired data structures in a multilevel study [142]. The multilevel approach consists of two steps. First, the variation between individuals is separated from the variation within the samples. Secondly, PLS-DA analysis is performed on the within subject variation.

Let ࢀ૚ be a matrix containing all observations from sampling point 1, and similarly ࢀ૛ a matrix containing all observations from sampling point 2. Then the between subject variation, ࡹ, in the multilevel model is defined by

ࡹ=

ቂࢀ૚+ࢀ૛

ࢀ૚+ࢀ૛ቃ, and is thus the average of the two sampling time points.

The within subject variation, ࢃ, is defined by

ࢃ=ቂࢀ૚ െ ࢀ૛

ࢀ૛ െ ࢀ૚ቃ, And is the net difference between the two sampling time points.

The second step in the multilevel approach is to analyze the within and between subject variations separately using PLS-DA. This is an effective tool for longitudinal data, where there are two or more multivariate measurements per subject. This approach can also be used for evaluating a treatment intervention, where ࢀ૚ would then contain observations from treatment group, and ࢀ૛ from control group. Exploiting the paired data structure provides complementary information about the diversity and abundance of the treatment effect within different subgroups across the study population as well as the intrinsic differences between the individuals in the study.

Random forest

A decision tree is made up by binary splitting of the decision space following a top-down approach. At each step of the algorithm a split of the decision is made to maximize the homogeneity of the two new parts. This is best illustrated by an example. Suppose we want to predict the body mass index (BMI) of an individual, given only the waist and the hip circumferences of the individual. The data is

1.4 Data analysis

shown in Figure 1.13, where the points are colored according to the associated BMI category, as defined by the world health organization (WHO) [143].

Figure 1.14 shows the process of fitting a classification tree to the data. The top split assigns individuals having hip circumference < 97.5 cm to the left branch. The predicted BMI category for these individuals is the majority vote of the BMIs of all points enclosed in this space, namely underweight.

Thus if the algorithm was stopped here, all individuals with a hip circumference < 97.5 cm would be classified as underweight, while all above as having a normal weight. The second split is at hip circumference = 112.5 cm. Now, in addition to the previous split, all individuals with a hip circumference above 112.5 cm are classified as pre-obese. This splitting is repeated, until a stopping criteria is met. The bottom figure of Figure 1.14 (bottom-left) shows the final classification tree fit to the data, and the associated partitions of the predictor space. Each rectangle in the predictor space is associated with a terminal node, and the points along the tree where the predictor space is split, are referred to internal nodes. The lines connecting the nodes are named branches. To make a new prediction, one starts at the top of the tree, and follows the criteria met at each internal node, until a terminal in met. This example showed a classification tree. However, it could easily be transformed to a regression tree, by replacing the BMI categories with the exact BMI values. In such a case, the predicted value at each terminal node would simply be the mean value of all observations in the corresponding space. A decision tree is intuitive and easily interpretable, however its performance is often worse than for other regression and classification approaches. Trees are also non-robust, meaning that a small change in the data can cause a large change in the final estimated tree.

Figure 1.13 The decision space for BMI as a function of the variables WaistCirc and HipCirc, which are waist and hip circumferences of some individuals, respectively.

By combining multiple decision trees the performance can be much improved. A random forest (RF) is an ensemble of B decision trees, where each tree in the forest is constructed on a bootstrapped sample of the original data [133, 144, 145]. The bootstrapped sample is created by sampling with replacement from the original data [146]. This approach largely reduces the variance and increases the robustness of the model. Another difference between a decision tree and a random forest, is that when growing a tree in the random forest, only a random subset, ݉, of the ݌ predictor variables can be considered at each split of the tree.

The objective for this is to break up the correlation between the trees in the forest, and to hinder strong predictors to dominate the model. When a random forest is grown, the final prediction for a new observation is the majority vote, or the mean value, of the individual trees, for classification and regression problems, respectively. A random forest has two tuning parameters. The size of the tree ensemble, ܤ, must be sufficiently large, for the error rate to reach a stable minimum, however it will not overfit if ܤ is increased [133]. The number of predictor variables, which can be considered at each split, must also be chosen. It is typically set to m= ξ݌ or ݌/3 when building a random forest of classification or regression trees, respectively. However, the best choice will depend on the problem, and a small value of ݉ will typically be beneficial when there is a large number of correlated predictors in the data [145]. A disadvantage of RFs is that the interpretability is lost at the expense of an increased prediction accuracy, compared to decision trees. The overall summary of importance of each predictor can however be obtained using the Gini index or root-mean squared error (RSS) for classification and regression problems, respectively, and is often represented in a bar plot. The total amount that the Gini index or RSS is decreased by splits over a given predictor variable, averaged over all ܤ trees can be recorded. A high value will indicate an important predictor.

1.4 Data analysis

Figure 1.14 The process of fitting a classification tree to the decision space in Figure 1.13.

Gradient boosting machines

Similarly to random forests, gradient boosting machine is an ensemble of decision trees. However, boosting does not involve bootstrap sampling, instead trees are grown sequentially, where each tree is fit on a modified version of the original data set [133, 145, 147, 148]. Given a current model, a decision tree is fit to the residuals from the model. This new tree is added to the existing tree ensemble, and the residuals are updated. The model is thus improved in areas where it performs poorly. The boosting approach is thus to learn slowly, where each individual tree is rather small. There are different variants of boosting algorithms, which will not be discussed in detail here [136].

Depending on the exact implementation of the algorithm, there can be different tuning parameters.

Importantly, the distribution function must be chosen to be Gaussian or Binomial, depending on if it is a regression or classification problem, respectively. The size of the individual trees must also be set, governed by the interaction depth, ݀. A model with only tree stumps (݀ = 1) often outperforms models with higher interaction depths, if the number of trees is sufficiently large. To make the algorithm learn even more slowly, a shrinkage parameter, ߣ, can be set [148]. By setting ߣ to be below one, at iteration of the algorithm a shrunk version of a new tree is added to the tree ensemble. It is typically set to 0.01 or 0.001, depending on the problem. A very small ߣ might require using a very large ensemble size for achieving good performance. Another important tuning parameter is the number of trees in the ensemble, ܤ. This number is typically considerably higher than for a random forest, as each individual tree is here very small. These tuning parameters can be chosen during the training process by cross validation. Similarly as to RFs, relative variable influence in the model is given as an output.