SVM parameters - Imputation and classification of time series with missing data using machine l

SSI data

The data set was split into training set with data from 618 patients, from which the DTW distances and the kernel functions that go into in a kernel matrix of shape (618×618) are computed; a validation set with data from 133 patients, from which a kernel block matrix of shape (133×618) is computed; and a test set with data from of 132 patients, from

which another kernel block matrix with shape (132×618) is computed. A suitable gamma was selected using the median value of the DTW distances between patients for the given imputation method in the training set. Best value of C was found using was found using grid search on the validation set. Used F1-score to select a C value. Additionally a few stop criteria was set: the maximum of iterations the SVM could try to converge were 10000, and a minimum tolerance of 1E-3 difference was used for the validation set and the test set had a minimum tolerance of 1E-6 difference between each iteration.

Synthetic missing data

The same procedure was used for the uwave data with synthtic missing data, with an added pre-processing step in order to counter a glitch with the particular data set that was discovered during preliminary experiments. When splitting the distance matrix and the corresponding labels using the StratifiedShuffleSplit method, a synchronised shuffling of the indice labels and the distance matrix values had to be made beforehand. The training set indices was of length 783 resulting in a kernel of size (783×783), with a validation set with 168 label indices giving a validation set of shape (168×783). The length and shape of the test set was the same as the validation set.

4.4.1 TCN parameters SSI data

The data was augmented and divided into training data and test data by a 80/20 per-centage split. The training data was split further, where 20% of the training set (16%

of the full data set) was used for validation by the Keras network training method. The learning rate of the optimizer was set to 0.0001 and was assumed to be sufficient but did not have time to validate. Dilations used was 1, 2, 4, 8, 16, and 32, in that order. Dropout rate was set to 0.1 for all layers. Class weights was set equal to 1 for class 0 (non-SSI), and Ntr/Ny=1 where Ny=1 are the number of training labels marked class 1 and Ntr are the total number of training labels. The loss function used was the binary cross-entropy function. Early stopping parameters were set to a minimum ∆ (required improvement of loss function) of 0.01 and a patience (how many epochs to wait) of 10, with the maximum number of epochs set to 250. Both validation loss and validation accuracy was monitored.

The early stopping mechanism stores the weights for each epoch, and if the algorithm stops early because the validation fail to improve, the best weights stored by the early stopping mechanism is used. The number of neurons was determined through validation.

The complete process was run 10 times for each imputation method.

Uwave with missing data

The same parameters used for the SSI data was used on this data set, with one difference being that the labels and data had to be synchronously shuffled before splitting the data, same as with the SVM.

Chapter 5

Results and discussion

Histograms of the DTW distances are shown below in 5.2.1 and Euclidean distances in 5.2.2for SSI data. The same kind of DTW distance and Euclidean distance histograms are shown in section 5.6.1 and 5.6.2, respectively, for the uwave dataset. All these plots are shown to explain and justify the value ofγ used by the SVM classifier and to capture the difference between the imputation methods, as experienced by the distance-based classifiers.

Informative missingness is discussed in section 5.1. The differences between the his-tograms of DTW and Euclidean distance is shown in5.2.3for the SSI data and in section 5.6.3 for the uwave data. Then in 5.3 and 5.7 the kNN score metrics for the test set is shown for the SSI and uwave data, respectively, along with plots of the F₁ scores for their validation sets. Thereafter, in 5.4 (SSI) and 5.8 (uwave data), are tables for the SVM classifier shown followed by plots showing balanced accuracy and F1 score below. What is designated as non-SSI has been assigned with the label 0, and SSI has been labeled as 1.

Each validation run is independent from each other and starts with training from start.

5.1 SSI missing data

As mentioned in section 2.3, missing data can be informative; lacking data that may be MAR, MNAR, or MCAR, or a mix of MAR and MCAR. Some of this informative missingness can be seen in figure2.2 which shows an image matrix for the data for each blood test type, a bar plot with Pearson correlation between missing data and incidence of SSI, and the missing rate. The data is as mentioned earlier done before gastrointestinal surgery, which is a collective term used for surgery on the esophagus, pancreas, liver, gallbladder, and the rest of the digestive system [115].

One can see in figure2.2that the missing rate of all the blood tests increase from start to end with some variation. Carbomide (c), creatinine (d), glucose (f), and thrombocytes (k) have the most data missing, which can be seen in both the data image matrix and the bar plot with missing rates. Overall, taking blood tests on day 0 and 2 seem common for 6 of the 11 blood tests: albumin (a), CRP (e), glucose (f), hemoglobin (g), potassium (i), and sodium (j). CRP increases on day 1 on a number of patients, and leukocytes increase on day 7 for several. The amount of thrombocytes seem to increase as time goes on for several patients, but there is no obvious common start day for when the increase happen.

One can also see that there is a weak positive correlation between infection and data missing on day 0 for amylase and hemoglobin. This is speculation from the author, but a possible cause for not taking blood tests on day 0 of amylase and hemoglobin could be due to pancreatitis with complications, which is an infection of the pancreas resulting

from digestive enzymes consuming the pancreas’ tissue. A common cause of pancreatitis is alcoholism[116]. Alcohol is known to dehydrate the body, which in turn cause hemoglobin levels to decrease due to reduced plasma volume[117], while the infection in the pancreas increase the amylase levels. Common treatments for acute versions of this condition is to get hydrated with intravenous fluids and undergo fasting while resting. To get more accurate test results, a medical practitioner who knows the condition will probably wait to test hemoglobin and amylase levels. As these missing data possibly infer alcoholism, which is a sensitive topic, they may also affect a machine learning algorithm with access to health records used in areas such as social services, insurance cases, court cases, health treatment, and inadvertently cause discrimination.

In document Imputation and classification of time series with missing data using machine learning (sider 43-46)