• No results found

Using motion data to detect cardiac dysfunctions

N/A
N/A
Protected

Academic year: 2022

Share "Using motion data to detect cardiac dysfunctions"

Copied!
108
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Using motion data to detect cardiac dysfunctions

A machine learning approach

Ole Kristian Stumpf

Master’s Thesis Spring 2016

(2)
(3)

Using motion data to detect cardiac dysfunctions

A machine learning approach

Ole Kristian Stumpf

May 2, 2016

The Interventional Centre Oslo University Hospital,

Rikshospitalet Faculty of Medicine

University of Oslo Oslo, Norway

Department of Informatics Faculty of Mathematics

and Natural Sciences University of Oslo

Oslo, Norway

(4)
(5)

Abstract

This thesis investigates the possibility of predicting four different cardiac (heart) dysfunctions using machine learning and cardiac motion data. The intended goal is to integrate a motion sensor into a pacemaker, thus en- abling continuous monitoring of cardiac motion during and after cardiac surgery.

The cardiac motion of the four different dysfunctions was provided to this thesis. The measurements of the cardiac motion had been collected from 20 pigs at the Intervention Centre at Oslo University Hospital, Oslo, Nor- way. Each measurement was split into sequences representing one com- plete heart beat. Six different sets of features were created from the se- quences. These feature sets were tested on three different classifiers: deci- sion tree, SVM, and AdaBoost.

Results show that predicting one out of the five dysfunctions is not feasible using the current dataset. The sparse dataset and the overlap in data makes it difficult for the classifiers to discriminate between the various dysfunc- tions. Predicting one cardiac function from another, however, shows good results. Some combinations of two cardiac functions are relevant for clinical use because they are often expected either during or after cardiac surgery.

These are combinations including baseline and a dysfunction. Baseline ver- sus adrenaline, ischemia, and fluid loading respectively, all achieved a pre- cision and recall of more than 0.80. These results demonstrate the feasibility of using machine learning and motion data from the heart to predict certain cardiac functions.

(6)
(7)

Contents

1 Introduction 1

1.1 Outline . . . 2

2 Background 3 2.1 Medical background . . . 3

2.1.1 The heart . . . 3

2.1.2 The cardiac cycle . . . 3

2.1.3 Cardiac motion . . . 5

2.1.4 Monitoring the heart . . . 6

2.2 Thesis background . . . 7

2.2.1 The novel technique . . . 7

2.2.2 Cardiac dysfunctions . . . 8

2.2.3 Past research regarding the novel technique . . . 11

2.2.4 Accelerometer . . . 11

2.3 Machine learning algorithms . . . 12

2.3.1 Machine learning . . . 13

2.3.2 Classifiers . . . 14

2.4 Features . . . 18

2.4.1 Curse of dimensionality . . . 18

2.4.2 Feature extraction . . . 19

2.4.3 Feature scaling . . . 19

2.4.4 Feature selection . . . 19

2.5 Model validation . . . 20

2.5.1 Cross-validation . . . 20

2.5.2 No free lunch theorem . . . 21

2.5.3 Hyperparameter optimization . . . 21

2.5.4 Overfitting . . . 22

2.6 Statistics . . . 22

2.6.1 Performance measures . . . 23

2.6.2 Significance test . . . 24

3 Implementation 27 3.1 Choice of implementation environment and language . . . . 27

3.2 Motion data segmentation . . . 28

3.2.1 The dataset . . . 28

3.2.2 Segmentation . . . 30

3.2.3 Achieving a fixed length of the sequences . . . 31

(8)

3.2.4 Graphs of the motions sequences . . . 33

3.3 Feature extraction . . . 37

3.3.1 Domain knowledge . . . 37

3.3.2 Machine learning knowledge . . . 41

3.4 Training and testing . . . 41

3.5 Choice of grid search . . . 42

3.6 Approach . . . 44

4 Experiments and results 49 4.1 Two-class experiments and results . . . 49

4.1.1 Analysis . . . 51

4.2 Further improving the best two-class models . . . 52

4.2.1 Experiments . . . 52

4.2.2 Results . . . 53

4.2.3 Analysis . . . 54

4.3 Details of three two-class models . . . 56

4.3.1 Best performing model . . . 57

4.3.2 Model using only three features . . . 59

4.3.3 Model using the raw data as features . . . 61

4.4 Optimizing certain combinations of two classes . . . 63

4.4.1 Experiments . . . 63

4.4.2 Results . . . 64

4.4.3 Analysis . . . 64

4.5 Adding classes to the classification task . . . 65

4.5.1 Experiments . . . 65

4.5.2 Three classes - results . . . 65

4.5.3 Four classes - results . . . 66

4.5.4 Five classes - results . . . 67

4.5.5 Analysis . . . 69

5 Discussion 71 5.1 General discussion . . . 71

5.2 Conclusion . . . 73

5.3 Future work . . . 74

Appendices 77 A MATLABs decision tree split criterions 79 B MATLABs SVM kernels 81 C Statistics 83 C.1 Feature selection methods . . . 83

C.2 P-value . . . 84

(9)

List of Figures

2.1 Diagram of the heart [4] showing the four chambers and blood vessel connected to the heart. . . 4 2.2 Figure showing the difference between diastole and systole

[63]. . . 4 2.3 Figure showing the decomposition of cardiac movement into

a radial, circumferential and longitudinal axes [3]. . . 5 2.4 Figure showing the characteristic ECG trace for a cardiac

cycle [2, 8]. . . 6 2.5 Figure showing a heart examination using transthoracic

echocardiogram [49] . . . 7 2.6 View of the heart showing the 3-axis accelerometer (motion

sensor). Arrow indicates the circumferential motion [30]. . . 8 2.7 Figure showing myocardial ischemia. [5] . . . 9 2.8 Illustration of the left ventricle, showing the basis and apex

of the heart (also referred to as basal and apical regions), the LAD and the location of the occlusion and the motion sensor (accelerometer) [28] . . . 10 2.9 A conceptual drawing of an accelerometer. It consists of

a mass m, that can move relative to the accelerometer’s housing, a spring that connects the mass to the housing k, and a damper that dissipates energy and keeps the spring- mass system from vibrating forever. The other constants are:

the damping constant c, the position of the moving block relative to an inertial (stationary) frame of referencex, and the position of the housing relative to the same inertial frame y[1]. . . 12 2.10 A machine learning model consists of a machine learning

algorithm/classifier takes labeled feature vectors as inputs (seen in a) and produces a classifier model which can predict the label of new feature vectors (seen in b) [11]. . . 13

(10)

2.11 Figure showing a two-class classification problem. All squares belongs to the training set. Each square is a two- dimensional feature vector with values according tox1 and x2. The black and grey color indicates the label of the feature vector. The classifier has approximated the true function and is showed as the solid line (decision boundary). The white triangle represents a new and unseen inputs that the classifier will predict. Since it is placed above the decision boundary, the classifier will predict that it belongs to the black class. . . 14 2.12 An example of a decision tree that based on the outlook,

humidity and windy classifies a day to belong to either the P or N class [51]. . . 15 2.13 Figure showing two iterations of AdaBoost, labeled 1, 2 and

3. In figure 1, a point belonging to the triangle class is misclassified. The point gets a bigger weight, indicated by its bigger size and is thus correctly classified in figure two. This leads to a misclassification of a point in the star class. The point gets a bigger weight resulting in a perfect classification in figure 3. . . 16 2.14 Figure showing the optimal margin for a two dimensional

two-class classification problem [16]. The circles and crosses are feature vectors with values according to the x- and y-axis.

The circle and cross represent its belonging class. The exam- ples surrounded with a grey square denotes the support vec- tors. The dotted line is the decision boundary/hyperplane created by the SVM. . . 17 2.15 Figure showing the transformation, often denoted φ, trans-

forming the data into a linearly separable space. The solid line is the decision boundary, and the circles are data from the training set [6] . . . 18 2.16 Figure showing leave-one-out-cross-validation (LOOCV)

used on a dataset with ten examples. The lighter grey ex- ample represents a sample used in the testing set while the darker squares are samples in the training set. At each itera- tion, a new example is chosen as the testing example. When ten iterations have been done, all examples are used once in the training set, and nine times in the testing set . . . 21 2.17 Figure showing two classifiers, and a training set consisting

of two classes. The smooth line represents the good classifier as it doesn’t take into account the noisy data points, i.e. the data points that is surrounded by points from the other class.

The jagged line represents a classifier subject to overfitting.

It tries to maximize the accuracy on the training set, but it will not achieve a good accuracy on new and unseen data points [7]. . . 22

(11)

LIST OF FIGURES

2.18 Figure showing a classifier predicting cancer. Patients are represented as circles. All patients lying within the circle is predicted as cancer. . . 24 2.19 Figure illustrating the two-sample Kolmogorov-Smirnov

statistic. Red and blue lines each correspond to an empirical distribution function,and the black arrow denotes the test statistic. . . 25 3.1 Figure showing the motion data collected from one animal

during during a cardiac function of a period of 10 seconds.

The timeline is shown on the x-axis. The upper graph shows the acceleration while the next two are the velocity and displacement, marked in the top right corner of each graph.

The acceleration, velocity, and displacement are shown as the calculated resultant of all three directions. The bottom graph is the ECG where the R-peak is shown as a red cross.

The graph turns green during the first 150 ms after R-peak, which is also indicated in the motion data as well. The reason for showing the first 150 ms after R-peak in green is given in section 3.3. . . 29 3.2 Figure showing how a cycle that is represented with more

than 300 samples is interpolated and decimated in order to be represented by exactly 300 samples. . . 32 3.3 Figure showing the average acceleration all cycles in the

circumferential direction. . . 34 3.4 Figure showing the average velocity of all cycles in the

circumferential direction. . . 34 3.5 Figure showing the average acceleration of all cycles in the

longitudinal direction. . . 35 3.6 Figure showing the average velocity of all cycles in longitu-

dinal the direction. . . 35 3.7 Figure showing the average acceleration of all cycles the

radial direction. . . 36 3.8 Figure showing the average velocity of a cycles in the radial

direction. . . 36 3.9 Figure showing velocity and displacement in circumferential

direction during baseline and ischemia. Thick lines indicate the middle of systole, as approximated by Halvorsenet al[30]. 37 3.10 Figure showing average peak midsystolic velocity, denoted

Vsys. The bottom figure is from an accelerometer mounted in the basal region of the heart (close to the top), but these measurements were discarded. The top figure is from an accelerometer mounted on the apical region (close to the bottom). Vsys is shown during all four interventions (LAD occlusion is ischemia) in addition to baseline (healthy). ± standard deviation is also shown. [23]. . . 38

(12)

3.11 Representative left ventricular curves from a single patient of accelerometer circumferential velocity and acceleration during baseline and preconditioning LAD occlusion (is- chemia) [28] . . . 40 3.12 Figure showing the 2-dysfunction-classification task which

aims to find the best feature set and classifier. The diamond boxes represent loops while the produce result block is specified in figure 3.13. The parallelograms represents an input and output. . . 46 3.13 Figure showing the steps in creating a model and achieving

a result. Rectangles represents a process that takes an input and produces an output. The dashed rectangle is a process that is sometimes skipped. White ellipses represents a result from a process. The parallelograms represents an input and output. . . 47 4.1 Figure showing how many times sample number from the

acceleration in the circumferential direction is chosen to be used as a feature in the two-class classification task. . . 62 4.2 Figure showing how many times sample number from the

velocity in the circumferential direction is chosen to be used as a feature in the two-class classification task. . . 63 C.1 Figure showing the basic idea of the Relieff feature selection

algorithm withkset to one. . . 83

(13)

List of Tables

2.1 A confusion matrix showing the results of a 3-class classifica- tion task. It shows that the classifier got all the instances that belonged to classC1 correct, but misclassified 2 instances of class C2 as C1 and is having difficulties getting theC3 class correct. Naturally, the desirable result is to have low num- bers on the off-diagonal. . . 24 3.1 Overview of how many subjects there are for each cardiac

function. The various cardiac functions are described in section 2.2.2. . . 28 3.2 Table showing how the number of cycles and the standard

deviation for each class. . . 33 3.3 Table showing the average number of samples used for each

cardiac dysfunction. The results were scaled into the range [0, 1]. . . 41 3.4 Table showing the hyperparameters used in the grid search

for decision tree. . . 43 3.5 Table showing the hyperparameters used in the grid search

for AdaBoost. . . 43 3.6 Table showing the hyperparameters used in the grid search

for SVM. . . 44 4.1 Table showing the results after following the approach

shown in figures 3.12 and 3.13. The three best results are marked with bold text. The various feature sets are specified in section 3.3.1. . . 50 4.2 Table showing the hyperparameters used on the most

promising models which uses AdaBoost as its classifier. . . . 53 4.3 Table showing the hyperparameters used in the finer grid

search for decision tree. . . 53 4.4 Table showing the results of the experiments specified in

section 4.2. The best result is shown in bold. Experiment number five and six are based on the underlined results from experiment four. . . 54 4.5 Table specifying that models that will be further investigated. 57

(14)

4.6 Table showing the accuracy and confusion matrix of each class on the model that performed best, which were Ad- aBoost using all features, achieving an average accuracy of 0.84. Regarding the confusion matrix, columns represents the true class, while rows represents the predicted class. . . . 57 4.7 Table showing the accuracy and confusion matrix of each

class on the model that performed second best, which were decision tree using all features and Relieff feature selection.

In addition, the features chosen by Relieff is also shown. The numbers point to the list of features shown in 4.3 . . . 59 4.8 Table showing the accuracy and confusion matrix a model

only using the raw data as features, in combination with CFS feature selection method. . . 61 4.9 Table showing which models will be created for certain

combinations of two classes. Note that a grid search will be done for each of the combinations. All features are summarized in section 4.2.3. . . 63 4.10 Table showing accuracy, precision, recall and true nega-

tive rate for the clinically most important combinations of classes. The latter class in the leftmost column is used as positive class in the calculations. . . 64 4.11 Table specifying the models used to achieve the results

shown in table 4.10. . . 64 4.12 Table showing which models will be created when adding

classes to the classification task. . . 65 4.13 Table showing the details of the best performing models

using three classes. . . 66 4.14 Table showing the decomposition of the average accuracy

for the best model, shown in table 4.13. . . 66 4.15 Table showing the details of the best performing models

using three classes. . . 66 4.16 Table showing the decomposition of the average accuracy

for the best model, shown in table 4.15. . . 66 4.17 Table showing the details of the best performing models

using five classes. . . 68 4.18 Table showing the model achieving the best accuracy when

including all classes. . . 68 4.19 Table showing the model achieving the second best accuracy

when including all classes. . . 68 B.1 MATLABs SVM kernels. σandγanddare hyperparameters

which must be set on beforehand outside of the training process. Hyperparameters is reviewed in section 2.5.3. . . . 81

(15)

Acknowledgment

I would like to express my deepest gratitude to my three supervisors: Ole Jacob Elle, Espen Remme and Jim Tørresen, for the continuous guidance and support. A special thanks to PhD candidate Ole Marius Hoen Rindal from DSP and Image Analysis Research Group, who voluntary acted as a fourth, and highly valuable, supervisor.

My sincere thanks to all the fellow students at Robotics and Intelligent Systems research group for making a great learning environment since the very start of the bachelor studies. Especially thanks to Joakim Myrvoll Jo- hansen for the valuable discussions regarding the structure of the thesis.

I would also like to thank my friends and family and especially my signif- icant other Cecilie N. Michaelsen for the endless support and her feedback on the thesis.

(16)
(17)

Chapter 1

Introduction

Recent decades have shown that a vast number of problems can be solved using machine learning. This is a result of an improved understanding of machine learning and an increase in computational resources. When the dimensionality of the data is too big for the human eye, machine learning has often proved useful. Research in the field of motion sensors and ma- chine learning has mostly been focused on activity classification. That is, classifying the activity currently performed, such as walking, running and climbing stairs while wearing a motion sensor. This has been done with promising results [50, 53], hence, it proposes its use with more fine-grained motion.

A novel technique for detecting heart diseases using cardiac (heart) motion has been developed at The Intervention Centre, Oslo University Hospital (OUS), Rikshospitalet, Oslo, Norway. The intended goal is to use a com- bined miniaturized accelerometer and temporary pacemaker lead attached to the heart during surgery and withdrawn before the patient is discharged.

Such a sensor would permit continuous monitoring of ventricular function during and after cardiac surgery, enabling early detection of complications and treatment guidance, [22, 29]. However, the substantial amount of data produced by the sensor makes it difficult to analyze. The goal of this thesis is to use machine learning and create a predictor which can label the heart function of patients.

One of the cardiac dysfunctions is myocardial ischemia, which is one of the most prominent heart diseases. Patients who have undergone cardiac surgery has an increased chance of experiencing ischemia. 2-6% of arterial and 10-20% of venous grafts (transplanted blood vessels) are occluded af- ter one year [32]. Early detection of ischemia is therefore highly desirable.

However, the tools for continuously monitoring and detecting ischemia have room for improvements. The most prominent tool is electrocardio- graphy (ECG), which monitors the electrical activity of the heart. The ECG trace will exhibit a specific abnormal activity when ischemia is present, but this is not always the case [36]. There is also cases where the abnormal trace is exhibited, but ischemia is not present [62]. Another tool is the echocar-

(18)

diography, but it requires a skilled technician to interpret the images and is therefore too cumbersome to use as a continuous monitoring technique.

Hence, a sensor that continuously monitors the heart and automatically de- tects ischemia during after heart surgery would be beneficial [54].

The machine learning model that will be built will take use of a dataset created at OUS. Several pigs underwent interventions to simulate the dys- function. Three additional interventions were done to alter the heart func- tion using the same pigs. These include injection of the drugs esmolol and adrenaline, which decreases and increases the heart rate, respectively, as well as an intervention to decrease the stroke volume. A model which can classify these cardiac dysfunctions increases the potential clinical applica- bility of the accelerometer in cardiac surgery [23]. The intended approach is to test several machine learning classifiers and feature sets. In addition, all combinations of two- three-, and four-, combinations will be tested.

1.1 Outline

The thesis is divided into four additional chapters: (i) background (ii) implementation (iii) experiments and results (iv) and discussion. The background chapter will elaborate further regarding the novel monitoring technique. Additionally, it will introduce the various relevant machine learning concepts. The implementation chapter will explain and reason behind the various choices done regarding the machine learning model.

The experiments and results chapter outlines the experiments conducted and presents the results along with a short analysis where appropriate.

Lastly, the discussion chapter contains a general discussion along with a conclusion of the thesis and suggestions for future work.

(19)

Chapter 2

Background

This chapter will attempt to give an introduction to the theory relevant to the thesis. A medical background is presented first, which is intended to give an understanding of the heart and its motion. Next is a part which describes a novel technique and the past research leading which lead to this thesis. The following section, section 2.3 introduces machine learning in general and the three machine learning algorithms used.

2.1 Medical background

2.1.1 The heart

The heart (cardiac), seen in figure 2.1, is a muscular organ roughly shaped as a cone, with a wide basis and a pointed apex (tip). It is located behind and slightly left of the breastbone, with the apex pointing downwards and to the left. The heart contains four chambers: upper left and right atria and lower left and right ventricles. The atria work as blood collection chambers for the ventricles which expel blood out of the heart. Blood enters the right atrium which pumps it downwards to the right ventricle. The right ventricle pumps blood to the pulmonary system, i.e., the lungs, which fills the blood with oxygen and the left atria then collect it. The left atrium pumps it downwards to the left ventricle which pumps it back out to the circulatory system, i.e., the body [58].

2.1.2 The cardiac cycle

The cardiac cycle refers to a complete heartbeat from its generation to the beginning of the next beat. A normal resting heart rate for adults ranges from 60 to 100 beats a minute, which implies that the heart rate lasts from 0.6 to 1 seconds. The cardiac cycle is split into two main phases: the diastole and the systole. The diastole refers to the part of the cycle where the ventricles are relaxed in preparation for refilling with circulating blood, whereas the systole refers to the ejection of blood into an adjacent chamber or vessel. This is seen in figure 2.2. During most of the diastole, blood is passively flowing from the atria to the ventricles. The atria contract at

(20)

Figure 2.1: Diagram of the heart [4] showing the four chambers and blood vessel connected to the heart.

the end of diastole, which propels an additional amount of blood into the ventricles. Systole represents the time in which the myocardium (cardiac muscle) contracts. In the first phase of systole, blood is flowing from the atria to the ventricles while the myocardium contracts. In the second phase of systole, blood is expelled out of the ventricles to the body and lungs while the myocardium still contracts. Several other phases within the

Figure 2.2: Figure showing the difference between diastole and systole [63].

cardiac cycle are defined. The two most interesting ones in this project are the mid-systole and the isovolumetric relaxation (IVR) phase and is reasoned in section 3.3. Formally, mid-systole begins after the first of two heart sounds and ends before the second heart sound [52]. Mid-systole is the phase where the valves which keep the blood from flowing towards the lungs and body is are open, and blood is ejected through them. The IVR phase occurs at the end of systole, where the ventricular muscle relaxes and decreases its tension without lengthening so that ventricular volume

(21)

2.1. MEDICAL BACKGROUND

remains the same [21]

2.1.3 Cardiac motion

This section aims to provide an overview regarding how the heart moves during the cardiac cycle. The motion will be decomposed into three axes, shown in figure 2.3, because the motion data at hand is decomposed in the same way. Hence, it can give a better understanding of the data.

Figure 2.3: Figure showing the decomposition of cardiac movement into a radial, circumferential and longitudinal axes [3].

Cirumferential motion Circumferential motion is the rotation of the heart, that is, the movement upon itself. Usually, the rotation is measured from the left ventricle and in the apical region (lower part of the heart) and is given in degrees. The rotation of the left ventricle apical region is about 11 ± 5 in situ healthy human heart [54], reaching peak velocities at the beginning of ventricular systole [13]. The rotation is in a counterclockwise rotation as seen from the apex. By the middle of ventricular systole, the basal and mid-ventricular segments reverse their circumferential motion, reaching peak clockwise velocities by the end of this phase. The apical segments, however, continues their counter-clockwise circumferential motion until ventricular repolarization occurs.

Longitudinal motion Regarding movement in the longitudinal direction, there is little apical displacement during systole. All segments move downwards towards the left ventricular apex. Before the end of systole, a fast recoil motion occurs, moving all segments upwards. The movement continues until the blood stops being expelled from the left ventricle.

Radial motion The radial motion on the epicardial surface reduces due to the wall thickening and thinning that occur during the phases when the endocardial surface moves inward and outward, respectively.

Consequently, the dominant motion of the left ventricle epicardial apical region, where the motion sensor is placed is circumferential displacement or rotation [54].

(22)

2.1.4 Monitoring the heart

The following sections will give an introduction in various techniques used to do cardiac monitoring and their weaknesses.

Figure 2.4: Figure showing the characteristic ECG trace for a cardiac cycle [2, 8].

Electrocardigraphy Electrocardiography (ECG) is the process of monitor- ing the electrical activity of the heart over time. The monitoring is achieved by placing electrodes on the skin that detects the tiny electrical changes that arise from the depolarization of the heart muscle during each heart beat. The electrical changes are displayed as a graph that shows voltage over time. A healthy heart will show the characteristic ECG tracing where each wave is named, seen in figure 2.4.

A typical ECG tracing is a repeating cycle of three electrical entities: a P wave, which shows the atrial depolarization (contraction), a QRS complex which shows ventricular depolarization, and finally a T wave which shows ventricular repolarization (expansion/relaxation). To the trained clinician, an ECG conveys a large amount of information about the structure of the heart and the function of its electrical conduction system. Ischemia causes a decreased blood flow to muscle tissues to the heart. This causes a de- polarization of the resting membrane potential of the ischemic region to the resting membrane potential of the normal region, which is manifested as either an elevated or depressed ST segment in the ECG, though not al- ways. There are several drawbacks of using ECG as detection method for ischemia. These includes: (i) The ST-depression or elevation is not present in the ECG trace (ii) Baseline drift (iii) Varying ST-T patterns in ECG of the same patient (iv) Noise (v) Power line interference (vi) Patient dependent

(23)

2.2. THESIS BACKGROUND

abnormal ST depression levels [44].

Echocardiography Echocardiography uses high-pitched sound waves sent through a device called a transducer to create images of the heart, seen in figure 2.5. A standard echocardiogram is also known as a transthoracic echocardiogram or cardiac ultrasound. Views of the heart are obtained by moving a the transducer to different locations on the chest. It is the most widely used diagnostic technique in cardiology as it can provide helpful information about the heart regarding the size, shape, pumping capacity and the location and extent of tissue damage. It can also provide an accurate assessment of myocardial ischemia [39]. However, it requires a skilled technician to operate the transducer and interpret the images and is therefore too cumbersome to use for continuous monitoring [28].

Figure 2.5: Figure showing a heart examination using transthoracic echocardiogram [49]

2.2 Thesis background

Having introduced the most relevant medical theory, the past research leading to this project can be presented.

First, a novel technique for monitoring the motion of the heart is given.

The motion data produced by this technique is used in this project. Then, the various cardiac functions that the technique is intended to monitor is presented. Lastly, a summary of the preceding studies regarding the tech- nique is given.

2.2.1 The novel technique

This thesis bases on a novel technique for automatic detection of various cardiac dysfunctions developed at The Intervention Centre, Oslo Univer- sity Hospital, Rikshospitalet, Oslo, Norway. The technique uses a motion

(24)

sensor attached to the heart, seen in figure 2.6. The motion sensor is a 3- axis accelerometer, measuring acceleration in the three directions specified in figure 2.3.

Figure 2.6: View of the heart showing the 3-axis accelerometer (motion sensor). Arrow indicates the circumferential motion [30].

Several pigs underwent interventions to simulate the various cardiac dys- functions the technique was intended to monitor. During the interven- tions, measurements from the motion sensor were collected into a dataset.

It is this dataset this project will take use of to make the machine learn- ing model. The reason for attaching a motion sensor is that myocardial (heart muscle) dysfunction associates with motion abnormalities [64,66]. A miniaturized motion sensor attached to the cardiac wall may continuously monitor myocardial motion and automatically detect abnormalities. The proposed use of the technique is during and after cardiac surgery [22], as patients undergoing cardiac surgery may experience myocardial dysfunc- tion or ischemia during surgery or the first few days following the opera- tion [15]. Additionally, present alternatives, ECG, and echocardiography, have their mentioned shortcomings.

2.2.2 Cardiac dysfunctions

The following paragraphs introduce the four different cardiac dysfunctions the model will try to classify from each other and baseline (healthy).

Consequently, they will also be denoted as classes. The paragraphs will explain its consequences as well as how they can effectively be inflicted to patients. Note that these functions cannot appear simultaneously.

(25)

2.2. THESIS BACKGROUND

Figure 2.7: Figure showing myocardial ischemia. [5]

Ischemia/Occlusion Ischemia is the term for what happens when the my- ocardium (heart muscle) does not receive enough blood, usually caused by a narrowing or blockage of one or more of the blood vessels supplying blood to the myocardium, as seen in figure 2.7. When this happens, the muscle cells which doesn’t receive enough blood starts to hibernate. When hibernating, the muscle cell remains viable, but contraction is chronically depressed because the muscle cells try to reduce energy consumption, lead- ing to a change in motion. But if the blood flow to the hibernating muscle cells is too low, the muscle cells will die and cannot be recovered. Common symptoms are chest pain or discomfort which may travel into the shoulder, arm, back, neck or jaw. The pain tends to get worse with activity and go away with rest. Other symptoms are shortness of breath, sweating and a feeling of nausea. The most severe complication is a heart attack, which happens if the blood vessel becomes completely blocked. The part of heart muscle dependent on blood going through the obstructed blood vessel will die. A heart attack can be fatal if immediate emergency services is not re- ceived [14].

Myocardial ischemia is the most common cause of death in most Western countries and a major cause of hospital admissions [9]. It might be treated with medications that improve blood flow to the heart muscle. If the pa- tient does not respond to the medications, surgery is needed. The most common surgical procedures are (i) angioplasty and stenting and (ii) coro- nary artery bypass surgery. The former procedure forces the vessel open by inserting a stent while the latter creates a graft around the narrowed area by using a vessel from another body part.

Ischemia is inflicted onto the animals by using a vascular occluder nearby a blood vessel. The vascular occluder can be thought of as a pinch, thus hin- ders the blood flowing through the vessel. The blood vessel occluded is the left anterior descending artery (LAD) as seen in figure 2.8. The LAD and its branches supply the left ventricle with blood and is the most commonly

(26)

occluded for research purposes. It is the thickest of the heart’s chambers and is responsible for pumping oxygen-rich blood to tissues all over the body.

Figure 2.8: Illustration of the left ventricle, showing the basis and apex of the heart (also referred to as basal and apical regions), the LAD and the

location of the occlusion and the motion sensor (accelerometer) [28]

Low contractility/Esmolol Myocardial contractility refers to the ability of the myocardium to contract. An injection of the drug esmolol will cause a decrease in the force and rate of heart contraction [35]. It is commonly used in patients during surgery to prevent the heart rate increasing past normal rate, a condition known as tachycardia. Injection too much esmolol can cause the heart to become ischemic.

High contractility/Adrenaline Epinephrine, also known as adrenaline, is a hormone that can be used as a stimulant in cardiac arrest. Cardiac arrest is a sudden stop in effective blood circulation due to a stop in myocardial contraction. Epinephrine will increase blood flow as well as increase the heart rate, muscle strength and blood pressure [38]. Epinephrine is also used to treat anaphylactic shock, the allergic reaction causing the heart to be unable to pump enough blood. The details of the infliction of both esmolol and adrenaline onto the animals can be found in [22].

Fluid loading The intervention regarding fluid loading will enhance preload [47]. Preload is the measure of the initial stretching of the cardiac muscle cells at the end of diastole. Increasing preload will cause an increase in stroke volume while decreasing preload will cause a decrease in stroke

(27)

2.2. THESIS BACKGROUND

volume1. To determine how the accelerometer measurements responded to an elevation in volume loading (preload) of the heart, rapid intravenous infusion of body tempered isotonic saline2was performed.

2.2.3 Past research regarding the novel technique

A brief overview of the research regarding the novel technique will be given here3. Elleet al.[19] describes the first study of how accelerometers can be used to recognize cardiac ischemia. An accelerometer was sutured on the left ventricle in the region of blood supply of the left anterior descending coronary artery (LAD) on three different pigs. The LAD was totally occluded, and it was shown that the positive to negative peak systolic acceleration was reduced by approximately 40% from baseline to occlusion. A method to detect the early changes in the measured acceleration just after occlusion and during incipient ischemia were proposed using a difference value based on the FFT frequency pattern. The next study, by Halvorsenet al. [28] concluded that “...accelerometer detects myocardial ischemia with great accuracy. This novel technique has potential to improve monitoring of myocardial ischemia during cardiac surgery”. Further development was made by Grymyret al.[23]. A method for automatically calculating a 3D velocity vector from accelerations in circumferential, longitudinal and radial directions was made. The method facilitates insertion of the accelerometer as no precise alignment is required. The weakness with the studies is that the data is manually analyzed. When a substantial amount of data is collected, creating a rule-based system for prediction is difficult and poses a risk of missing out on valuable pieces of data that. This weakness alleviates with machine learning, which instead of following rule-based models, automatically identifies discriminative values in the measurements to predict the outcome of new motion data.

There has been a limited amount of research using machine learning and a motion sensor attached to the heart to detect various cardiac dysfunctions.

2.2.4 Accelerometer

The motion sensor used by the described technique is an accelerometer.

An accelerometer is a device that measures proper acceleration, meaning the physical acceleration experienced by an object. A conceptual drawing is shown in figure 2.9.

1The volume of blood pumped from the left ventricle per beat

2Sterile sodium chloride solution at physiological concentration (0.87%); used as a wound cleansing agent or irrigating fluid

3A detailed description of the technique can be found in the following research papers:

(sorted by date of publicity): [19, 22, 23, 28–30, 54]

(28)

Figure 2.9: A conceptual drawing of an accelerometer. It consists of a mass m, that can move relative to the accelerometer’s housing, a spring that connects the mass to the housingk, and a damper that dissipates energy

and keeps the spring-mass system from vibrating forever. The other constants are: the damping constantc, the position of the moving block relative to an inertial (stationary) frame of referencex, and the position of

the housing relative to the same inertial framey[1].

One can think of an accelerometer as a device that behaves like a damped mass attached at the end of a spring. When the device experiences an acceleration, the mass is displaced to the point that the spring is able to accelerate the mass at the same rate as the casing. The displacement is then measured to give the acceleration. Most studies using an accelerometer and machine learning have been in the field of activity detection and fall detection. Ravi et al. [53] uses an accelerometer worn near the pelvic region and manages to detect different activities such as standing, walking and running with high accuracy. The measurements from the accelerometer were used to calculate the average, standard deviation, energy and correlation, which in turn were used to classify the activities.

Albertet al. used the embedded smartphone accelerometer and a machine learning model to detect if the subject had fallen, as well as classifying the type of fall, were four types of fall were defined. The model achieved 98% accuracy in detecting fall and 99% accuracy in detecting the type of fall. These results demonstrate the potential use of accelerometer, suggests that it could be used to classify more fine-grained motions, such as cardiac function. No past research using an accelerometer to predict cardiac function, except for the studies mentioned in the previous paragraph were found.

2.3 Machine learning algorithms

The remainder of this chapter will introduce machine learning and its various concepts relevant to this thesis. This section will attempt to introduce machine learning in general and shortly present three different

(29)

2.3. MACHINE LEARNING ALGORITHMS

machine learning algorithms.

2.3.1 Machine learning

Figure 2.10 shows the general procedure of implementing and using a machine learning model to predict the true label of new and unseen data.

Figure 2.10: A machine learning model consists of a machine learning algorithm/classifier takes labeled feature vectors as inputs (seen in a) and

produces a classifier model which can predict the label of new feature vectors (seen in b) [11].

Machine learning is the process of building a model from a dataset in order to make data-driven predictions or decisions. The dataset consists of pairs, seen as training examples. Each pair consists of an input/feature vector and the corresponding output, also referred to as the class. The training process consists of analyzing the pairs and producing an inferred function, which then can be used for mapping new and unseen data to a class. More precisely, given training set of N example input-output pairs

(x1,y1),(x2,y2), ...(xN,yN),

where yj was generated by an unknown function y = f(x), the train- ing/learning task is to discover a function h that approximates the true functionf, also denoted the decision boundary. When the number of input- output pairs in the dataset grows, the difference betweenhto the true func- tionf decreases, hence, we say that the model is learning the true function from the input. We say thathis a good approximation off, if the model gen- eralizes well. That is, that the model often predicts the correct label to new and unseen inputs. The learning process inherent in any machine learning model is what separates machine learning from an algorithm that follow strictly static program instructions.

Figure 2.11 shows an example of a training set, a machine learning algo- rithms approximationhof the true functionf, and the prediction of a new and unseen input.

(30)

Figure 2.11: Figure showing a two-class classification problem. All squares belongs to the training set. Each square is a two-dimensional feature

vector with values according tox1andx2. The black and grey color indicates the label of the feature vector. The classifier has approximated the true function and is showed as the solid line (decision boundary). The

white triangle represents a new and unseen inputs that the classifier will predict. Since it is placed above the decision boundary, the classifier will

predict that it belongs to the black class.

2.3.2 Classifiers

The classifier, or the machine learning algorithm, as seen in figure 2.10 are the component which produces the inferred function h=f(x). There are a vast number of classifiers and the three next paragraphs will give a summary of the classifiers intended to be used in this thesis and their strengths and weaknesses and why they have been chosen.

Decision trees The decision tree classifier is one of the most simple classifiers, hence the first to be introduced. It was first presented by JR Quinlan in 1986 [51]. The learning phase involves constructing a tree consisting of nodes and edges using the dataset. A node represents a test on one of the features, and the edges represent an outcome of the test. Leaf nodes represent the outcome class. Predicting then consists of following a path from the root node to a leaf node. An example of a decision tree is seen in figure 2.12.

An important aspect of decision trees is how to create one, as different trees give rise to different classifiers. MATLABs (used in this project) method for creating classification tree is known as a top-down induction of decision trees (TDIDT) and is an example of a greedy algorithm. It works recursively in the following manner:

1. Start with all input data and examines all possible binary splits on every feature.

2. Select a split with the best optimization criterion. Information about

(31)

2.3. MACHINE LEARNING ALGORITHMS

Figure 2.12: An example of a decision tree that based on the outlook, humidity and windy classifies a day to belong to either the P or N

class [51].

the split criterion is given in appendix A.

3. Impose the split

4. Repeat recursively for the two child nodes.

There are several reasons for the choice of testing decision tree in this project. It is conceptually simple and deals well with noise [43] (data that is not representative of the true function). It is a white box model, meaning that the decisions that led to the specific prediction can be seen.

This is desirable if knowledge about the dataset is sought. One of the disadvantages stems from its greedy strategy. Finding the right root of the tree can greatly impact the final result. I.e., small changes early on can have big effects later. Furthermore, if the true underlying function is truly linear, decision tree does usually not work well. Lastly, decision tree does not function well if there are large numbers of uncorrelated features since it works by finding the interactions between variables.

Ensemble learning Ensemble learning is the process of combining multiple classifiers into one to obtain better performance. The assumptions behind combining classifiers are: (i) if every classifier is wrong in one unique subset of the dataset, the model will still predict correctly since the rest of the classifiers are correct on the same subset, and (ii) this holds for every subset. The question arises on how to combine the classifiers used. The most widely used way of combining classifiers is through a technique called boosting. In 1989, Robert E. Schapire introduced the first boosting algorithm [59], in the paper “The strength of weak learnability”.

The general idea is to collect a set of poor classifiers, where each classifier performs slightly better than chance. By combining them together, they will perform arbitrarily well. An improvement of boosting was introduced by Yoav Freund and the same Schapire in a paper from 1999 [20], where AdaBoost (Adaptive boosting) was presented. The fundamental difference

(32)

was to assign weights to each feature vector according to how difficult previous classifiers have found to get it correct. In short, AdaBoost works iteratively until some number of iterations is reached or if the error rate is beneath some threshold. At each iteration, a new classifier is trained on the weighted training set. Initially, all weights are set equally, but in each round, the weights of incorrectly classified feature vectors are increased.

The next classifier will then have to focus to get these feature vectors correct. The change of weights is usually set to decrease with a fixed number as the number of iterations grows. This is because the uncertainty increases with the number of iterations. This fixed number is known as the learning rate. The general idea can be seen in figure 2.13.

Figure 2.13: Figure showing two iterations of AdaBoost, labeled 1, 2 and 3.

In figure 1, a point belonging to the triangle class is misclassified. The point gets a bigger weight, indicated by its bigger size and is thus correctly classified in figure two. This leads to a misclassification of a point in the star class. The point gets a bigger weight resulting in a perfect

classification in figure 3.

AdaBoost is proven to work on small datasets which are the main reason it is chosen to be tested [45]. AdaBoost uses the mentioned decision trees as its weak learners. In practice, the trees will favor splitting feature vectors with high weights. Consequently, noisy feature vectors will be given higher weights. Hence, AdaBoost will not work well when the number of outliers is to be considered high. Additionally, training can be time-consuming when the dataset is large, and convergence time can be slow for complex problems [60]. This weakness is alleviated since the dataset at hand is relatively sparse

Support vector machines (SVM) The original SVM algorithm was first presented by Vladimir N. Vapnik and Alexey Ya. Chervonekis in 1963 while the standard used today was introduced by Corinna Cortes and Vladimir Vapnik in 1995 [16]. It is considered to be a complex classifier.

Hence, the presentation of it will be limited to the general idea.

(33)

2.3. MACHINE LEARNING ALGORITHMS

Figure 2.14: Figure showing the optimal margin for a two dimensional two-class classification problem [16]. The circles and crosses are feature

vectors with values according to the x- and y-axis. The circle and cross represent its belonging class. The examples surrounded with a grey

square denotes the support vectors. The dotted line is the decision boundary/hyperplane created by the SVM.

Consider the 2-class classification problem in figure 2.14 where the classes are represented as crosses and circles and the approximationh = f(x)i.e. the decision boundary is represented as a dotted line. The prediction of new data consists of investigating which side of the decision boundary it lies on.

If it lies above the dotted line, it is classified as belonging to the cross-class, and it is classified as the circle class if it lies below. If the predicted data lies close to the decision boundary, one can see that if the decision boundary was moved only a small amount, there is a risk of misclassifying that data.

SVMs however, will take this into account and try to find the best decision boundary. Formally, the best decision boundary is the one that lies fur- thermost away from any of the training examples. That decision boundary is said to be maximizing the margin. Hence, the training process consists of creating a decision boundary which maximizes the margin. Cortes et al[16] showed that the process of maximizing the margin with respect to the input vectors only depends on the dot product of them. This is a key observation as an apparent weakness of SVMs is how it tries to create a straight line between the classes in the 2-dimensional case or a hyperplane in the high-dimensional case. Consequently, SVMs apparently only feasi- ble if the classes are linearly separable, as the problem in figure 2.14. In order to overcome this issue, one can transform the data to a higher dimen- sional space where the data is linearly separable, as seen in figure 2.15. The decision boundary was dependent only on the dot-product of the feature vectors. This means that the transformation itself is not needed, but only a function that implicitly computes the high-dimensional dot-product. This function is referred to as a kernel function and is a vital component of the

(34)

Figure 2.15: Figure showing the transformation, often denotedφ, transforming the data into a linearly separable space. The solid line is the

decision boundary, and the circles are data from the training set [6]

classifier. The linear, polynomial and Gaussian kernel is provided by MAT- LAB. A more detailed description about them is given in appendix B. The choice of which kernel to use depend on the dataset, but the RBF kernel may be the most used kernel [12], hence is the only kernel to be tested in this project. This is done to ease the computational complexity.

The reason for including SVM as one of the three classifiers is that it often provide significantly better classification performance than other classifiers on reasonably sized datasets [45]. It has strong theoretical foundations and excellent empirical successes [37]. Furthermore, SVM is very robust against ill-behaved distributions, high dimensionalities and a low ratio of training examples to the dimensionality of the input data [10]. One drawback is the lack of transparency in results as the computation might be done in a very high dimensional space. SVM is also computationally expensive, thus runs slow for large datasets [40]. As with AdaBoost, this weakness is alleviated since the dataset at hand is relatively sparse.

2.4 Features

Choice of features, the components of a feature vector, is crucial when building a model. They are the inputs to the model and is used to discriminate between the classes. For this reason, they need to be informative to the classes. For instance, if the model’s task is to detect spam mail, examples of informative features could be the frequency of certain terms and the grammatical correctness of the text. The following paragraphs will elaborate more about the choice of features.

2.4.1 Curse of dimensionality

The curse of dimensionality refers to the problem that arises when we have high-dimensional input vectors, we will need more data to enable the model to generalize adequately. When the dimensionality of the input

(35)

2.4. FEATURES

vector increases, the complexity of the underlying pattern might increase.

Regardless, the model needs more examples to uncover the underlying pattern, which is not always possible, as in this project.

2.4.2 Feature extraction

Feature extraction is the process of computing the feature vectors that is used as input to the classifier. This is also the “feature extractor” block in figure 2.10. The features extracted should be values that separate the class well. Informally, a good set of features will make it easy for the classifier to discriminate between the various classes, but finding these features is domain specific. The choice of features in this project is elaborated and reasoned in the next chapter, in section 3.3.

2.4.3 Feature scaling

Different values in a feature vector often take on different ranges which are not desirable. Consider two features, one ranging from 1 to 10, and the other from 1-1000. If the classifier uses the distance between the data points as a way of classifying, it will be governed by the latter feature, as the differences in the former are always small in comparison. Ensuring standardized feature values implicitly weights all features equally in their representation.

2.4.4 Feature selection

Feature selection is the process of selecting a subset of features that are relevant. There are several benefits of doing feature selection [25]: (i) in- creasing classifier performance (ii) reducing storage requirements (iii) re- ducing training time (iv) and defying curse of dimensionality. Feature se- lection methods are usually divided into three categories; Filters, wrapper methods and embedded methods. Filters do not involve the classifier but analyze intrinsic properties of the data to find useful features. Wrappers, on the other hand, uses a classifier and tend to perform better since, but is also more computationally inefficient for large datasets [31]. Addition- ally, wrappers tend to overfit when the number of examples is insufficient.

Embedded methods are essentially a classifier with an embedded feature selection. Selecting features is therefore an inherent part of the learning process. They tend to be more computationally efficient compared with wrappers [57]. Note that a feature that is completely useless by itself can provide a significant performance improvement when used together with other features [25].

Two feature selection methods will be tried and tested, one filter method and one wrapper method. The filter method is the Correlation-based Fea- ture Selection (CFS), and the wrapper method is the Relieff algorithm. An introduction and their strengths and weaknesses are given in appendix C.1.

(36)

Lastly, it is important to note that the no free lunch theorem that is de- scribed in section 2.5.2 also holds for feature selection. Consequently, there is no optimal feature selection method for all problems. More than one method will therefore be used. Guoet al.[24] reported that Relieff and CFS performed consistently better than seven other feature selection methods.

Additionally, using a filter and a wrapper will ensure that two conceptually different methods are tried and tested.

2.5 Model validation

This section intends to introduce concepts regarding testing of machine learning models. The first paragraph will describe how a machine learning model is commonly tested. The next paragraph states why more than one classifier is tested in this project. Next, an approach on how to find the best parameters for a classifier is described. The final paragraph, 2.5.4 will present the concept of overfitting, which needs to be addressed when training and testing a machine learning model.

2.5.1 Cross-validation

Techniques must be used to assess the quality of the model. A common technique is to split the dataset in a training set and a test set and use a model validation technique known as cross-validation. The model is built using the training set and the quality assessed using the test set. The test set is assumed to approximate the typical unseen data that a model will encounter. The weakness is that this approach only works for large datasets, and not the dataset in this project. If the dataset is not large enough, the two sets will not represent the true function properly, since they become more sensitive to noise. A common workaround is known as k-fold cross-validation. Here, the dataset is randomly partitioned intok subsets, where one subset is used as a test set and the rest is the training set. The model is trained using the training set and then assessed using test set. A new partition is selected as the test set and the old test set becomes part of the training set. This process is repeated until all partitions have been used as a test set. The results can be averaged to produce a single error estimate. This procedure has the advantage that all examples in the dataset are eventually used for both training and testing. Ifk is set to the number of feature vectors in the training set, it is known as leave-one-out- cross validation (LOOCV), and is shown in figure 2.16.

(37)

2.5. MODEL VALIDATION

Figure 2.16: Figure showing leave-one-out-cross-validation (LOOCV) used on a dataset with ten examples. The lighter grey example represents

a sample used in the testing set while the darker squares are samples in the training set. At each iteration, a new example is chosen as the testing example. When ten iterations have been done, all examples are used once

in the training set, and nine times in the testing set

Here, exactly one feature vector is used as the test set and the rest as the training set. LOOCV can be an alternative when the dataset is small since the focus should be on training on as many examples as possible at once.

2.5.2 No free lunch theorem

The no free lunch theorem [67] states that there is no classifier that works best for every problem as the assumptions of a great model for one problem may not hold for another problem. Therefore, it is common to build several models and choose the best model for the specified problem.

2.5.3 Hyperparameter optimization

Every classifier needs some hyperparameters to be set. A hyperparameter regarding machine learning are values that must be specified outside the training procedure. The setting of a hyperparameter can influence the accuracy of the trained model. Optimal hyperparameter settings often differ for different datasets, and it is difficult to know on beforehand which hyperparameters are the best. Hsu et al. [34] recommend what is known as grid search as a strategy for finding the best parameters. Grid search is simply an exhaustive search through a manually specified subset of the parameter space of a classifier. Grid search is the safest way to find the best hyperparameters, in contrast to search by approximations or heuristics which may get stuck in local optima. The biggest weakness of grid search is its computational complexity. The problem quickly becomes intractable when the number of hyperparameters increases, although there are ways to overcome this weakness. Not all hyperparameters are dependent, which means that the search can be parallelized. Additionally, Hsu et al. [34] found that for trying exponentially growing sequences of the hyperparameters to a SVM is a practical method to identify good

(38)

parameters since it will test a wider range of hyperparameters. When doing so, one will quickly skip past the best region of the grid. To avoid this, one can proceed to do an additional finer grid search on the most promising region of the grid.

2.5.4 Overfitting

Overfitting occurs if there is not enough training data or if the learning process has resulted in approximating the noise in the data, as opposed to the underlying true function. If there is not enough training data, the dataset will not contain the underlying function, hence, the classifier will not be able to generalize to new data. If the learning process is done too extensively, it might lead the classifier to learn about the inherent noise in the training set, instead of the true underlying function. This is seen in figure 2.17. Noise has no causal relation to the true function. Consequently, the model will not be able to generalize well. As noise is inherent in measuring any real world process, overfitting must always be taken into consideration. The cross-validation techniques average the performance over all training- and test sets hence assesses the presence of overfitting.

Figure 2.17: Figure showing two classifiers, and a training set consisting of two classes. The smooth line represents the good classifier as it doesn’t

take into account the noisy data points, i.e. the data points that is surrounded by points from the other class. The jagged line represents a

classifier subject to overfitting. It tries to maximize the accuracy on the training set, but it will not achieve a good accuracy on new and unseen

data points [7].

2.6 Statistics

This section is divided into two subsections. The former subsection intro- duces the various performance measures used while the latter subsection introduces the test statistic used.

(39)

2.6. STATISTICS

2.6.1 Performance measures

Accuracy Accuracy can be both a two-class and multiclass performance measure. Accuracy is the number of correct predictions divided by the number of predictions. One needs to take care if one class is by far the most represented one in the dataset. Classifying every example as this class will still yield a high accuracy. In this dataset, however, the number of examples in each class is fairly balanced, with a standard deviation of 2.2.

2-class performance measures Precision4and recall5are common 2-class performance measures:

Precision= TP TP+FP Recall= TP

TP+FN

Some research papers also include the specificity6measure.

Specificity= TN TN+FP

The four different outcomes of a binary classifier are true positive (TP), false positive (FP), true negative (TN) and false negative (FN). True and false refers to whether the classifier was correct or not in its prediction and positive and negative are general names for the two outcome classes.

True positive is therefore the correct (true) classification of an object that belonged to the positive class. A machine learning model which predicts cancer or not cancer is used as an example of precision and recall, seen in figure 2.18. Patients predicted to have cancer lies inside the circle. Precision is the fraction of patients predicted to have cancer and also had cancer.

Recall is the fraction of patients having cancer and also were predicted as such. Specificity is similar to recall, but with respect to the not-cancer class.

Different problems may wish to focus on achieving either a high precision or a high recall. When predicting a disease, as in the example mentioned above and in this project, recall can be viewed as more important than precision. As long as further assessment is done before treatment, it is important to recognize as many as possible as having the disease. Even though it might come at a cost of assessing patients which were not diseased. Precision and recall is further addressed in the chapter discussing the results.

Multiclass performance measures In the multiclass class, it is interesting to see if some classes were easier for the classifier to predict than others.

Confusion matrices will yield such information. A confusion matrix is a

4Also referred to as positive predictive value

5Also referred to as sensitivity or true positive rate

6Also referred to as true negative rate

(40)

Figure 2.18: Figure showing a classifier predicting cancer. Patients are represented as circles. All patients lying within the circle is predicted as

cancer.

square matrix with one row for each predicted class, and one column for each true class (or vice-versa) Table 2.1 shows an example of a confusion matrix.

Actual C1

Actual C2

Actual C3

PredictedC1 10 2 0

PredictedC2 2 8 0

PredictedC3 4 4 4

Table 2.1: A confusion matrix showing the results of a 3-class classification task. It shows that the classifier got all the instances that belonged to class C1 correct, but misclassified 2 instances of class C2 as C1 and is having difficulties getting the C3 class correct. Naturally, the desirable result is to have low numbers on the off-diagonal.

2.6.2 Significance test

Two-sample Kolmogorov-Smirnov test The two-sample Kolmogorov- Smirnov test (KS-test) tests if two datasets differ significantly. It has the advantage of making no assumption about the distribution data, that is, it is non-parametric and distribution free. Other tests, like the Student’s t- test, requires that the data follows a Student’s t distribution. This has the advantage of making it more sensitive than the KS-test. Specifically, the KS-test statistic is

Dn,n0 =sup

x

|F1,n−F2,n0| (2.1) where F1,n, F2,n are the empirical distribution functions, that is, the distribution function associated with the empirical measure of the data and where n and n’ is the sample size of the two datasets respectively.

(41)

2.6. STATISTICS

Regarding the null hypothesis, (see appendix C.2 for an overview on null hypothesis and p-value), it is rejected at levelαif

Dn,n0 > c(α)

rn+n0

nn0 (2.2)

where c(α)is a known function mapping from α. Informally, the KS-test computes the cumulative distribution function (CDF) of the two datasets and finds their maximum CDF difference, denoted D. This is seen in figure 2.19.

Figure 2.19: Figure illustrating the two-sample Kolmogorov-Smirnov statistic. Red and blue lines each correspond to an empirical distribution

function,and the black arrow denotes the test statistic.

(42)
(43)

Chapter 3

Implementation

This chapter intends to explain the various choices that need to be considered in creating the machine learning models. The first section reasons behind the choice of implementation environment and language.

Section 3.2 explains the segmentation on the dataset. The features extracted is presented in section 3.3. Section 3.4 explains how the models are trained and tested. The second to last section will present the grid search performed. The final section will present how the machine learning models are created and how the optimal models will be identified.

3.1 Choice of implementation environment and lan- guage

Some requirements were taken into account in the choice of implementa- tion environment and language. It is beneficial to use libraries to reduce de- velopment time. Hence, the range of machine learning applications avail- able was important. For this reason, MATLABR was chosen. Its Statistics and Machine Learning ToolboxTMoffers a wide variety of classifiers, includ- ing the tree classifiers intended to be tested in this project. Furthermore, the toolbox provides a straightforward calculation of all the performance mea- sures. Some feature selection methods are included, but perhaps not suf- ficiently. However, several 3rd-party libraries exist as MATLAB is a very popular choice for performing machine learning tasks. Though some care needs to be taken, when it comes to the 3rd-party libraries. No official quality check has necessarily been done, so the code might contain bugs and errors. Another reason for the choice is the fact that vectorized oper- ations can be done effortlessly in MATLAB and will be used extensively since the dataset is represented as a matrix. The wide range of libraries available means that it will be easy to try and test a vast number of models and modifications. Other choices were deliberately not used for various reasons. C/C++ is fast but has less machine learning libraries. Java has a broad range of machine learning libraries but is not as high-level as Python and MATLAB, hence, will require more time to create the model. One of the most popular alternatives to MATLAB is using Python combined with

Referanser

RELATERTE DOKUMENTER

Professor Jan Myrheim, tel.. b) An energy measurement is performed when the particle is in the state (1). What are.. the possible results, and what are

Moreover, a silane (GPS) surface treatment is applied for improving the adhesion between the particles and the surrounding matrix. More details are found in [19]. The data set is

This report documents the experiences and lessons from the deployment of operational analysts to Afghanistan with the Norwegian Armed Forces, with regard to the concept, the main

association. Spearman requires linear relationship between the ranks. In addition Spearman is less sensible for outliers, and a more robust alternative. We also excluded “cases

Fix load,% i production capacity fixed to a given fraction of installed capacity Spills i,p whether the plant can spill product p.. C spill i,p the cost for spilling product C unit

In order to do this, two other protocols been chosen for comparison, the Tree-based Group Key Agreement (TGDH) protocol which is used in an existing secure cloud solution and the

This paper presents, in contrasts to the desktop metaphor, a content centric data surface interaction paradigm for graphical user interfaces applied to music creativity

Since well yields normally are small, systema ti c approaches to groundwater prospecting are required in order to predict how to site the wells and what costs to expect.The